WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** [default0]:using world size: 384, data-parallel-size: 8, tensor-model-parallel size: 4, pipeline-model-parallel size: 12 [default0]:WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer with tokenizer_type:PretrainedFromHF [default0]:accumulate and all-reduce gradients in fp32 for bfloat16 data type. [default0]:using torch.bfloat16 for parameters ... [default0]:------------------------ arguments ------------------------ [default0]: abort_on_unmet_fused_kernel_constraints ......... True [default0]: accumulate_allreduce_grads_in_fp32 .............. True [default0]: adam_beta1 ...................................... 0.9 [default0]: adam_beta2 ...................................... 0.95 [default0]: adam_eps ........................................ 1e-08 [default0]: adlr_autoresume ................................. False [default0]: adlr_autoresume_interval ........................ 1000 [default0]: apply_query_key_layer_scaling ................... True [default0]: apply_residual_connection_post_layernorm ........ False [default0]: attention_dropout ............................... 0.1 [default0]: attention_softmax_in_fp32 ....................... False [default0]: bert_binary_head ................................ True [default0]: bert_load ....................................... None [default0]: bf16 ............................................ True [default0]: bias_dropout_fusion ............................. True [default0]: bias_gelu_fusion ................................ True [default0]: biencoder_projection_dim ........................ 0 [default0]: biencoder_shared_query_context_model ............ False [default0]: block_data_path ................................. None [default0]: checkpoint_activations .......................... True [default0]: checkpoint_in_cpu ............................... False [default0]: checkpoint_num_layers ........................... 1 [default0]: clip_grad ....................................... 1.0 [default0]: codecarbon_dir .................................. None [default0]: consumed_train_samples .......................... 0 [default0]: consumed_train_tokens ........................... 0 [default0]: consumed_valid_samples .......................... 0 [default0]: contigious_checkpointing ........................ False [default0]: cpu_optimizer ................................... False [default0]: cpu_torch_adam .................................. False [default0]: curriculum_learning ............................. False [default0]: data_impl ....................................... mmap [default0]: data_parallel_size .............................. 8 [default0]: data_path ....................................... None [default0]: dataloader_type ................................. single [default0]: DDP_impl ........................................ local [default0]: decoder_seq_length .............................. None [default0]: deepscale ....................................... False [default0]: deepscale_config ................................ None [default0]: deepspeed ....................................... True [default0]: deepspeed_activation_checkpointing .............. True [default0]: deepspeed_config ................................ ./ds_config.176449.json [default0]: deepspeed_mpi ................................... False [default0]: distribute_checkpointed_activations ............. False [default0]: distributed_backend ............................. nccl [default0]: embed_layernorm ................................. True [default0]: embedding_path .................................. None [default0]: encoder_seq_length .............................. 2048 [default0]: eod_mask_loss ................................... False [default0]: eval_interval ................................... 1000 [default0]: eval_iters ...................................... 10 [default0]: eval_only ....................................... None [default0]: evidence_data_path .............................. None [default0]: exit_duration_in_mins ........................... 1190 [default0]: exit_interval ................................... None [default0]: ffn_hidden_size ................................. 57344 [default0]: finetune ........................................ False [default0]: fp16 ............................................ False [default0]: fp16_lm_cross_entropy ........................... False [default0]: fp32_residual_connection ........................ False [default0]: gigaflos_no_embeds .............................. 0 [default0]: global_batch_size ............................... 2048 [default0]: glu_activation .................................. None [default0]: hidden_dropout .................................. 0.1 [default0]: hidden_size ..................................... 14336 [default0]: hysteresis ...................................... 2 [default0]: ict_head_size ................................... None [default0]: ict_load ........................................ None [default0]: img_dim ......................................... 224 [default0]: indexer_batch_size .............................. 128 [default0]: indexer_log_interval ............................ 1000 [default0]: init_method_std ................................. 0.0048 [default0]: init_method_xavier_uniform ...................... False [default0]: initial_loss_scale .............................. 4294967296 [default0]: kill_switch_path ................................ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/kill-switch-tr11-176B-exp1 [default0]: kv_channels ..................................... 128 [default0]: layernorm_epsilon ............................... 1e-05 [default0]: lazy_mpu_init ................................... None [default0]: load ............................................ /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default0]: local_rank ...................................... None [default0]: log_batch_size_to_tensorboard ................... True [default0]: log_interval .................................... 1 [default0]: log_learning_rate_to_tensorboard ................ True [default0]: log_level ....................................... None [default0]: log_level_replica ............................... None [default0]: log_loss_scale_to_tensorboard ................... True [default0]: log_num_zeros_in_grad ........................... False [default0]: log_params_norm ................................. False [default0]: log_path ........................................ None [default0]: log_timers_to_tensorboard ....................... True [default0]: log_validation_ppl_to_tensorboard ............... True [default0]: loss_on_targets_only ............................ False [default0]: loss_scale ...................................... None [default0]: loss_scale_window ............................... 1000 [default0]: lr .............................................. 6e-05 [default0]: lr_decay_iters .................................. None [default0]: lr_decay_samples ................................ 200000000 [default0]: lr_decay_style .................................. cosine [default0]: lr_decay_tokens ................................. None [default0]: lr_warmup_fraction .............................. None [default0]: lr_warmup_iters ................................. 0 [default0]: lr_warmup_samples ............................... 183105 [default0]: make_vocab_size_divisible_by .................... 128 [default0]: mask_prob ....................................... 0.15 [default0]: masked_softmax_fusion ........................... True [default0]: max_position_embeddings ......................... 2048 [default0]: memory_centric_tiled_linear ..................... False [default0]: merge_file ...................................... None [default0]: micro_batch_size ................................ 2 [default0]: min_loss_scale .................................. 1.0 [default0]: min_lr .......................................... 6e-06 [default0]: mmap_warmup ..................................... False [default0]: no_load_optim ................................... None [default0]: no_load_rng ..................................... None [default0]: no_save_optim ................................... None [default0]: no_save_rng ..................................... None [default0]: num_attention_heads ............................. 112 [default0]: num_channels .................................... 3 [default0]: num_classes ..................................... 1000 [default0]: num_layers ...................................... 70 [default0]: num_layers_per_virtual_pipeline_stage ........... None [default0]: num_workers ..................................... 2 [default0]: onnx_safe ....................................... None [default0]: openai_gelu ..................................... False [default0]: optimizer ....................................... adam [default0]: override_lr_scheduler ........................... False [default0]: pad_vocab_size_to ............................... 250880 [default0]: params_dtype .................................... torch.bfloat16 [default0]: partition_activations ........................... False [default0]: patch_dim ....................................... 16 [default0]: pipeline_model_parallel_size .................... 12 [default0]: position_embedding_type ......................... PositionEmbeddingType.alibi [default0]: pp_partition_method ............................. type:transformer|embedding [default0]: profile_backward ................................ False [default0]: query_in_block_prob ............................. 0.1 [default0]: rampup_batch_size ............................... ['16', '16', '9_765_625'] [default0]: rank ............................................ 0 [default0]: remote_device ................................... none [default0]: reset_attention_mask ............................ False [default0]: reset_position_ids .............................. False [default0]: retriever_report_topk_accuracies ................ [] [default0]: retriever_score_scaling ......................... False [default0]: retriever_seq_length ............................ 256 [default0]: reweight_loss_based_on_position_frequency ....... False [default0]: sample_rate ..................................... 1.0 [default0]: save ............................................ /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default0]: save_interval ................................... 50 [default0]: scatter_gather_tensors_in_pipeline .............. True [default0]: scattered_embeddings ............................ False [default0]: seed ............................................ 42 [default0]: seq_length ...................................... 2048 [default0]: sgd_momentum .................................... 0.9 [default0]: short_seq_prob .................................. 0.1 [default0]: skip_train_iteration_range ...................... None [default0]: split ........................................... None [default0]: split_transformers .............................. False [default0]: synchronize_each_layer .......................... False [default0]: tensor_model_parallel_size ...................... 4 [default0]: tensorboard_dir ................................. /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/tr11-176B-ml-logs/tensorboard [default0]: tensorboard_log_interval ........................ 1 [default0]: tensorboard_queue_size .......................... 5 [default0]: test_weighted_split_names ....................... ['test'] [default0]: test_weighted_split_paths ....................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']] [default0]: test_weighted_split_paths_path .................. None [default0]: test_weighted_split_splits ...................... [['0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0']] [default0]: test_weighted_split_weights ..................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']] [default0]: tile_factor ..................................... 1 [default0]: titles_data_path ................................ None [default0]: tokenizer_name_or_path .......................... bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k [default0]: tokenizer_type .................................. PretrainedFromHF [default0]: train_iters ..................................... None [default0]: train_samples ................................... 220000000 [default0]: train_tokens .................................... None [default0]: train_weighted_split_names ...................... ['train'] [default0]: train_weighted_split_paths ...................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']] [default0]: train_weighted_split_paths_path ................. None [default0]: train_weighted_split_splits ..................... [['0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949']] [default0]: train_weighted_split_weights .................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']] [default0]: use_bnb_optimizer ............................... False [default0]: use_checkpoint_lr_scheduler ..................... False [default0]: use_contiguous_buffers_in_ddp ................... True [default0]: use_cpu_initialization .......................... None [default0]: use_one_sent_docs ............................... False [default0]: use_pin_memory .................................. False [default0]: valid_weighted_split_names ...................... ['valid'] [default0]: valid_weighted_split_paths ...................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']] [default0]: valid_weighted_split_paths_path ................. None [default0]: valid_weighted_split_splits ..................... [['0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999']] [default0]: valid_weighted_split_weights .................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']] [default0]: virtual_pipeline_model_parallel_size ............ None [default0]: vocab_extra_ids ................................. 0 [default0]: vocab_file ...................................... None [default0]: weight_decay .................................... 0.1 [default0]: world_size ...................................... 384 [default0]: zero_allgather_bucket_size ...................... 0.0 [default0]: zero_contigious_gradients ....................... False [default0]: zero_reduce_bucket_size ......................... 0.0 [default0]: zero_reduce_scatter ............................. False [default0]: zero_stage ...................................... 0 [default0]:-------------------- end of arguments --------------------- [default0]:will use batch size rampup starting from global batch size 16 to global batch size 2048 with batch size increments 16 over 9765625 samples. [default0]:> building PretrainedFromHF tokenizer ... [default0]: vocab file is un-used. loading tokenizer from pre-trained model [default0]:Offline mode: forcing local_files_only=True [default0]:Offline mode: forcing local_files_only=True [default0]:Can't load following files from cache: ['added_tokens_file'] and cannot check if these files are necessary for the tokenizer to operate. [default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/special_tokens_map.json from cache at /gpfswork/rech/six/commun/models/b0b3428eb9bea3ef62a6e9983742117e4860f4ec1af66eebce1702b8ec7cb364.9d6cd81ef646692fb1c169a880161ea1cb95f49694f220aced9b704b457e51dd [default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/tokenizer_config.json from cache at /gpfswork/rech/six/commun/models/31fb66a88196017b3a12c4798e55bcf8a11b312b42dd9429c83f7237c0a8a807.e683c1a11fe6388761e34fd7cddbcd77f3552cefb70e9aca4a4cc72c027c8f40 [default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/tokenizer.json from cache at /gpfswork/rech/six/commun/models/b28b4c1d8aed4c72b765cce6a9a7ce8c5460d05a5b4ea6fa5855dff6a721d171.397b0d7316cb89fa15f0bebce2bd6c5e71e92a14e95de167940173a60253b03e [default0]: > padded vocab (size: 250680) with 200 dummy tokens (new size: 250880) [default0]:DeepSpeed general environment info: [default0]:torch install path ............... ['/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch'] [default0]:torch version .................... 1.11.0+cu115 [default0]:torch cuda version ............... 11.5 [default0]:nvcc version ..................... 11.4 [default0]:deepspeed install path ........... ['/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed'] [default0]:deepspeed info ................... 0.6.0+ed26ef4, ed26ef4, olruwase/bf16-updates [default0]:deepspeed wheel compiled w. ...... torch 1.11, cuda 11.5 [default0]:**** Git info for Megatron: git_hash=0415583 git_branch=sync-meg-lm **** [default0]:> initializing torch distributed ... [default7]:> setting tensorboard ... [default0]:> initializing tensor model parallel with size 4 [default0]:> initializing pipeline model parallel with size 12 [default0]:> setting random seeds to 42 ... [default0]:[2022-03-03 05:45:00,513] [INFO] [checkpointing.py:226:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2760 and data parallel seed: 42 [default0]:> compiling dataset index builder ... [default0]:make: Entering directory '/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/data' [default0]:make: Nothing to be done for 'default'. [default0]:make: Leaving directory '/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/data' [default0]:>>> done with dataset index builder. Compilation time: 0.106 seconds [default0]:> compiling and loading fused kernels ... [default0]:Detected CUDA files, patching ldflags [default0]:Emitting ninja build file /gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... [default0]:Building extension module scaled_upper_triang_masked_softmax_cuda... [default0]:Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [default0]:ninja: no work to do. [default0]:Loading extension module scaled_upper_triang_masked_softmax_cuda... [default0]:Detected CUDA files, patching ldflags [default0]:Emitting ninja build file /gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... [default0]:Building extension module scaled_masked_softmax_cuda... [default0]:Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [default0]:ninja: no work to do. [default0]:Loading extension module scaled_masked_softmax_cuda... [default0]:Detected CUDA files, patching ldflags [default0]:Emitting ninja build file /gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... [default0]:Building extension module fused_mix_prec_layer_norm_cuda... [default0]:Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [default0]:ninja: no work to do. [default0]:Loading extension module fused_mix_prec_layer_norm_cuda... [default0]:>>> done with compiling and loading fused kernels. Compilation time: 8.876 seconds [default0]:time to initialize megatron (seconds): 85.498 [default0]:[after megatron is initialized] datetime: 2022-03-03 05:45:09 [default0]:building GPT model ... [default0]:[2022-03-03 05:45:09,538] [INFO] [utils.py:828:see_memory_usage] Before Building Model [default0]:[2022-03-03 05:45:09,539] [INFO] [utils.py:829:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [default0]:[2022-03-03 05:45:09,539] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.16 GB, percent = 8.6% [default0]:SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None [default0]:Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=0, model=1): 1, ProcessCoord(pipe=0, data=0, model=2): 2, ProcessCoord(pipe=0, data=0, model=3): 3, ProcessCoord(pipe=0, data=1, model=0): 4, ProcessCoord(pipe=0, data=1, model=1): 5, ProcessCoord(pipe=0, data=1, model=2): 6, ProcessCoord(pipe=0, data=1, model=3): 7, ProcessCoord(pipe=0, data=2, model=0): 8, ProcessCoord(pipe=0, data=2, model=1): 9, ProcessCoord(pipe=0, data=2, model=2): 10, ProcessCoord(pipe=0, data=2, model=3): 11, ProcessCoord(pipe=0, data=3, model=0): 12, ProcessCoord(pipe=0, data=3, model=1): 13, ProcessCoord(pipe=0, data=3, model=2): 14, ProcessCoord(pipe=0, data=3, model=3): 15, ProcessCoord(pipe=0, data=4, model=0): 16, ProcessCoord(pipe=0, data=4, model=1): 17, ProcessCoord(pipe=0, data=4, model=2): 18, ProcessCoord(pipe=0, data=4, model=3): 19, ProcessCoord(pipe=0, data=5, model=0): 20, ProcessCoord(pipe=0, data=5, model=1): 21, ProcessCoord(pipe=0, data=5, model=2): 22, ProcessCoord(pipe=0, data=5, model=3): 23, ProcessCoord(pipe=0, data=6, model=0): 24, ProcessCoord(pipe=0, data=6, model=1): 25, ProcessCoord(pipe=0, data=6, model=2): 26, ProcessCoord(pipe=0, data=6, model=3): 27, ProcessCoord(pipe=0, data=7, model=0): 28, ProcessCoord(pipe=0, data=7, model=1): 29, ProcessCoord(pipe=0, data=7, model=2): 30, ProcessCoord(pipe=0, data=7, model=3): 31, ProcessCoord(pipe=1, data=0, model=0): 32, ProcessCoord(pipe=1, data=0, model=1): 33, ProcessCoord(pipe=1, data=0, model=2): 34, ProcessCoord(pipe=1, data=0, model=3): 35, ProcessCoord(pipe=1, data=1, model=0): 36, ProcessCoord(pipe=1, data=1, model=1): 37, ProcessCoord(pipe=1, data=1, model=2): 38, ProcessCoord(pipe=1, data=1, model=3): 39, ProcessCoord(pipe=1, data=2, model=0): 40, ProcessCoord(pipe=1, data=2, model=1): 41, ProcessCoord(pipe=1, data=2, model=2): 42, ProcessCoord(pipe=1, data=2, model=3): 43, ProcessCoord(pipe=1, data=3, model=0): 44, ProcessCoord(pipe=1, data=3, model=1): 45, ProcessCoord(pipe=1, data=3, model=2): 46, ProcessCoord(pipe=1, data=3, model=3): 47, ProcessCoord(pipe=1, data=4, model=0): 48, ProcessCoord(pipe=1, data=4, model=1): 49, ProcessCoord(pipe=1, data=4, model=2): 50, ProcessCoord(pipe=1, data=4, model=3): 51, ProcessCoord(pipe=1, data=5, model=0): 52, ProcessCoord(pipe=1, data=5, model=1): 53, ProcessCoord(pipe=1, data=5, model=2): 54, ProcessCoord(pipe=1, data=5, model=3): 55, ProcessCoord(pipe=1, data=6, model=0): 56, ProcessCoord(pipe=1, data=6, model=1): 57, ProcessCoord(pipe=1, data=6, model=2): 58, ProcessCoord(pipe=1, data=6, model=3): 59, ProcessCoord(pipe=1, data=7, model=0): 60, ProcessCoord(pipe=1, data=7, model=1): 61, ProcessCoord(pipe=1, data=7, model=2): 62, ProcessCoord(pipe=1, data=7, model=3): 63, ProcessCoord(pipe=2, data=0, model=0): 64, ProcessCoord(pipe=2, data=0, model=1): 65, ProcessCoord(pipe=2, data=0, model=2): 66, ProcessCoord(pipe=2, data=0, model=3): 67, ProcessCoord(pipe=2, data=1, model=0): 68, ProcessCoord(pipe=2, data=1, model=1): 69, ProcessCoord(pipe=2, data=1, model=2): 70, ProcessCoord(pipe=2, data=1, model=3): 71, ProcessCoord(pipe=2, data=2, model=0): 72, ProcessCoord(pipe=2, data=2, model=1): 73, ProcessCoord(pipe=2, data=2, model=2): 74, ProcessCoord(pipe=2, data=2, model=3): 75, ProcessCoord(pipe=2, data=3, model=0): 76, ProcessCoord(pipe=2, data=3, model=1): 77, ProcessCoord(pipe=2, data=3, model=2): 78, ProcessCoord(pipe=2, data=3, model=3): 79, ProcessCoord(pipe=2, data=4, model=0): 80, ProcessCoord(pipe=2, data=4, model=1): 81, ProcessCoord(pipe=2, data=4, model=2): 82, ProcessCoord(pipe=2, data=4, model=3): 83, ProcessCoord(pipe=2, data=5, model=0): 84, ProcessCoord(pipe=2, data=5, model=1): 85, ProcessCoord(pipe=2, data=5, model=2): 86, ProcessCoord(pipe=2, data=5, model=3): 87, ProcessCoord(pipe=2, data=6, model=0): 88, ProcessCoord(pipe=2, data=6, model=1): 89, ProcessCoord(pipe=2, data=6, model=2): 90, ProcessCoord(pipe=2, data=6, model=3): 91, ProcessCoord(pipe=2, data=7, model=0): 92, ProcessCoord(pipe=2, data=7, model=1): 93, ProcessCoord(pipe=2, data=7, model=2): 94, ProcessCoord(pipe=2, data=7, model=3): 95, ProcessCoord(pipe=3, data=0, model=0): 96, ProcessCoord(pipe=3, data=0, model=1): 97, ProcessCoord(pipe=3, data=0, model=2): 98, ProcessCoord(pipe=3, data=0, model=3): 99, ProcessCoord(pipe=3, data=1, model=0): 100, ProcessCoord(pipe=3, data=1, model=1): 101, ProcessCoord(pipe=3, data=1, model=2): 102, ProcessCoord(pipe=3, data=1, model=3): 103, ProcessCoord(pipe=3, data=2, model=0): 104, ProcessCoord(pipe=3, data=2, model=1): 105, ProcessCoord(pipe=3, data=2, model=2): 106, ProcessCoord(pipe=3, data=2, model=3): 107, ProcessCoord(pipe=3, data=3, model=0): 108, ProcessCoord(pipe=3, data=3, model=1): 109, ProcessCoord(pipe=3, data=3, model=2): 110, ProcessCoord(pipe=3, data=3, model=3): 111, ProcessCoord(pipe=3, data=4, model=0): 112, ProcessCoord(pipe=3, data=4, model=1): 113, ProcessCoord(pipe=3, data=4, model=2): 114, ProcessCoord(pipe=3, data=4, model=3): 115, ProcessCoord(pipe=3, data=5, model=0): 116, ProcessCoord(pipe=3, data=5, model=1): 117, ProcessCoord(pipe=3, data=5, model=2): 118, ProcessCoord(pipe=3, data=5, model=3): 119, ProcessCoord(pipe=3, data=6, model=0): 120, ProcessCoord(pipe=3, data=6, model=1): 121, ProcessCoord(pipe=3, data=6, model=2): 122, ProcessCoord(pipe=3, data=6, model=3): 123, ProcessCoord(pipe=3, data=7, model=0): 124, ProcessCoord(pipe=3, data=7, model=1): 125, ProcessCoord(pipe=3, data=7, model=2): 126, ProcessCoord(pipe=3, data=7, model=3): 127, ProcessCoord(pipe=4, data=0, model=0): 128, ProcessCoord(pipe=4, data=0, model=1): 129, ProcessCoord(pipe=4, data=0, model=2): 130, ProcessCoord(pipe=4, data=0, model=3): 131, ProcessCoord(pipe=4, data=1, model=0): 132, ProcessCoord(pipe=4, data=1, model=1): 133, ProcessCoord(pipe=4, data=1, model=2): 134, ProcessCoord(pipe=4, data=1, model=3): 135, ProcessCoord(pipe=4, data=2, model=0): 136, ProcessCoord(pipe=4, data=2, model=1): 137, ProcessCoord(pipe=4, data=2, model=2): 138, ProcessCoord(pipe=4, data=2, model=3): 139, ProcessCoord(pipe=4, data=3, model=0): 140, ProcessCoord(pipe=4, data=3, model=1): 141, ProcessCoord(pipe=4, data=3, model=2): 142, ProcessCoord(pipe=4, data=3, model=3): 143, ProcessCoord(pipe=4, data=4, model=0): 144, ProcessCoord(pipe=4, data=4, model=1): 145, ProcessCoord(pipe=4, data=4, model=2): 146, ProcessCoord(pipe=4, data=4, model=3): 147, ProcessCoord(pipe=4, data=5, model=0): 148, ProcessCoord(pipe=4, data=5, model=1): 149, ProcessCoord(pipe=4, data=5, model=2): 150, ProcessCoord(pipe=4, data=5, model=3): 151, ProcessCoord(pipe=4, data=6, model=0): 152, ProcessCoord(pipe=4, data=6, model=1): 153, ProcessCoord(pipe=4, data=6, model=2): 154, ProcessCoord(pipe=4, data=6, model=3): 155, ProcessCoord(pipe=4, data=7, model=0): 156, ProcessCoord(pipe=4, data=7, model=1): 157, ProcessCoord(pipe=4, data=7, model=2): 158, ProcessCoord(pipe=4, data=7, model=3): 159, ProcessCoord(pipe=5, data=0, model=0): 160, ProcessCoord(pipe=5, data=0, model=1): 161, ProcessCoord(pipe=5, data=0, model=2): 162, ProcessCoord(pipe=5, data=0, model=3): 163, ProcessCoord(pipe=5, data=1, model=0): 164, ProcessCoord(pipe=5, data=1, model=1): 165, ProcessCoord(pipe=5, data=1, model=2): 166, ProcessCoord(pipe=5, data=1, model=3): 167, ProcessCoord(pipe=5, data=2, model=0): 168, ProcessCoord(pipe=5, data=2, model=1): 169, ProcessCoord(pipe=5, data=2, model=2): 170, ProcessCoord(pipe=5, data=2, model=3): 171, ProcessCoord(pipe=5, data=3, model=0): 172, ProcessCoord(pipe=5, data=3, model=1): 173, ProcessCoord(pipe=5, data=3, model=2): 174, ProcessCoord(pipe=5, data=3, model=3): 175, ProcessCoord(pipe=5, data=4, model=0): 176, ProcessCoord(pipe=5, data=4, model=1): 177, ProcessCoord(pipe=5, data=4, model=2): 178, ProcessCoord(pipe=5, data=4, model=3): 179, ProcessCoord(pipe=5, data=5, model=0): 180, ProcessCoord(pipe=5, data=5, model=1): 181, ProcessCoord(pipe=5, data=5, model=2): 182, ProcessCoord(pipe=5, data=5, model=3): 183, ProcessCoord(pipe=5, data=6, model=0): 184, ProcessCoord(pipe=5, data=6, model=1): 185, ProcessCoord(pipe=5, data=6, model=2): 186, ProcessCoord(pipe=5, data=6, model=3): 187, ProcessCoord(pipe=5, data=7, model=0): 188, ProcessCoord(pipe=5, data=7, model=1): 189, ProcessCoord(pipe=5, data=7, model=2): 190, ProcessCoord(pipe=5, data=7, model=3): 191, ProcessCoord(pipe=6, data=0, model=0): 192, ProcessCoord(pipe=6, data=0, model=1): 193, ProcessCoord(pipe=6, data=0, model=2): 194, ProcessCoord(pipe=6, data=0, model=3): 195, ProcessCoord(pipe=6, data=1, model=0): 196, ProcessCoord(pipe=6, data=1, model=1): 197, ProcessCoord(pipe=6, data=1, model=2): 198, ProcessCoord(pipe=6, data=1, model=3): 199, ProcessCoord(pipe=6, data=2, model=0): 200, ProcessCoord(pipe=6, data=2, model=1): 201, ProcessCoord(pipe=6, data=2, model=2): 202, ProcessCoord(pipe=6, data=2, model=3): 203, ProcessCoord(pipe=6, data=3, model=0): 204, ProcessCoord(pipe=6, data=3, model=1): 205, ProcessCoord(pipe=6, data=3, model=2): 206, ProcessCoord(pipe=6, data=3, model=3): 207, ProcessCoord(pipe=6, data=4, model=0): 208, ProcessCoord(pipe=6, data=4, model=1): 209, ProcessCoord(pipe=6, data=4, model=2): 210, ProcessCoord(pipe=6, data=4, model=3): 211, ProcessCoord(pipe=6, data=5, model=0): 212, ProcessCoord(pipe=6, data=5, model=1): 213, ProcessCoord(pipe=6, data=5, model=2): 214, ProcessCoord(pipe=6, data=5, model=3): 215, ProcessCoord(pipe=6, data=6, model=0): 216, ProcessCoord(pipe=6, data=6, model=1): 217, ProcessCoord(pipe=6, data=6, model=2): 218, ProcessCoord(pipe=6, data=6, model=3): 219, ProcessCoord(pipe=6, data=7, model=0): 220, ProcessCoord(pipe=6, data=7, model=1): 221, ProcessCoord(pipe=6, data=7, model=2): 222, ProcessCoord(pipe=6, data=7, model=3): 223, ProcessCoord(pipe=7, data=0, model=0): 224, ProcessCoord(pipe=7, data=0, model=1): 225, ProcessCoord(pipe=7, data=0, model=2): 226, ProcessCoord(pipe=7, data=0, model=3): 227, ProcessCoord(pipe=7, data=1, model=0): 228, ProcessCoord(pipe=7, data=1, model=1): 229, ProcessCoord(pipe=7, data=1, model=2): 230, ProcessCoord(pipe=7, data=1, model=3): 231, ProcessCoord(pipe=7, data=2, model=0): 232, ProcessCoord(pipe=7, data=2, model=1): 233, ProcessCoord(pipe=7, data=2, model=2): 234, ProcessCoord(pipe=7, data=2, model=3): 235, ProcessCoord(pipe=7, data=3, model=0): 236, ProcessCoord(pipe=7, data=3, model=1): 237, ProcessCoord(pipe=7, data=3, model=2): 238, ProcessCoord(pipe=7, data=3, model=3): 239, ProcessCoord(pipe=7, data=4, model=0): 240, ProcessCoord(pipe=7, data=4, model=1): 241, ProcessCoord(pipe=7, data=4, model=2): 242, ProcessCoord(pipe=7, data=4, model=3): 243, ProcessCoord(pipe=7, data=5, model=0): 244, ProcessCoord(pipe=7, data=5, model=1): 245, ProcessCoord(pipe=7, data=5, model=2): 246, ProcessCoord(pipe=7, data=5, model=3): 247, ProcessCoord(pipe=7, data=6, model=0): 248, ProcessCoord(pipe=7, data=6, model=1): 249, ProcessCoord(pipe=7, data=6, model=2): 250, ProcessCoord(pipe=7, data=6, model=3): 251, ProcessCoord(pipe=7, data=7, model=0): 252, ProcessCoord(pipe=7, data=7, model=1): 253, ProcessCoord(pipe=7, data=7, model=2): 254, ProcessCoord(pipe=7, data=7, model=3): 255, ProcessCoord(pipe=8, data=0, model=0): 256, ProcessCoord(pipe=8, data=0, model=1): 257, ProcessCoord(pipe=8, data=0, model=2): 258, ProcessCoord(pipe=8, data=0, model=3): 259, ProcessCoord(pipe=8, data=1, model=0): 260, ProcessCoord(pipe=8, data=1, model=1): 261, ProcessCoord(pipe=8, data=1, model=2): 262, ProcessCoord(pipe=8, data=1, model=3): 263, ProcessCoord(pipe=8, data=2, model=0): 264, ProcessCoord(pipe=8, data=2, model=1): 265, ProcessCoord(pipe=8, data=2, model=2): 266, ProcessCoord(pipe=8, data=2, model=3): 267, ProcessCoord(pipe=8, data=3, model=0): 268, ProcessCoord(pipe=8, data=3, model=1): 269, ProcessCoord(pipe=8, data=3, model=2): 270, ProcessCoord(pipe=8, data=3, model=3): 271, ProcessCoord(pipe=8, data=4, model=0): 272, ProcessCoord(pipe=8, data=4, model=1): 273, ProcessCoord(pipe=8, data=4, model=2): 274, ProcessCoord(pipe=8, data=4, model=3): 275, ProcessCoord(pipe=8, data=5, model=0): 276, ProcessCoord(pipe=8, data=5, model=1): 277, ProcessCoord(pipe=8, data=5, model=2): 278, ProcessCoord(pipe=8, data=5, model=3): 279, ProcessCoord(pipe=8, data=6, model=0): 280, ProcessCoord(pipe=8, data=6, model=1): 281, ProcessCoord(pipe=8, data=6, model=2): 282, ProcessCoord(pipe=8, data=6, model=3): 283, ProcessCoord(pipe=8, data=7, model=0): 284, ProcessCoord(pipe=8, data=7, model=1): 285, ProcessCoord(pipe=8, data=7, model=2): 286, ProcessCoord(pipe=8, data=7, model=3): 287, ProcessCoord(pipe=9, data=0, model=0): 288, ProcessCoord(pipe=9, data=0, model=1): 289, ProcessCoord(pipe=9, data=0, model=2): 290, ProcessCoord(pipe=9, data=0, model=3): 291, ProcessCoord(pipe=9, data=1, model=0): 292, ProcessCoord(pipe=9, data=1, model=1): 293, ProcessCoord(pipe=9, data=1, model=2): 294, ProcessCoord(pipe=9, data=1, model=3): 295, ProcessCoord(pipe=9, data=2, model=0): 296, ProcessCoord(pipe=9, data=2, model=1): 297, ProcessCoord(pipe=9, data=2, model=2): 298, ProcessCoord(pipe=9, data=2, model=3): 299, ProcessCoord(pipe=9, data=3, model=0): 300, ProcessCoord(pipe=9, data=3, model=1): 301, ProcessCoord(pipe=9, data=3, model=2): 302, ProcessCoord(pipe=9, data=3, model=3): 303, ProcessCoord(pipe=9, data=4, model=0): 304, ProcessCoord(pipe=9, data=4, model=1): 305, ProcessCoord(pipe=9, data=4, model=2): 306, ProcessCoord(pipe=9, data=4, model=3): 307, ProcessCoord(pipe=9, data=5, model=0): 308, ProcessCoord(pipe=9, data=5, model=1): 309, ProcessCoord(pipe=9, data=5, model=2): 310, ProcessCoord(pipe=9, data=5, model=3): 311, ProcessCoord(pipe=9, data=6, model=0): 312, ProcessCoord(pipe=9, data=6, model=1): 313, ProcessCoord(pipe=9, data=6, model=2): 314, ProcessCoord(pipe=9, data=6, model=3): 315, ProcessCoord(pipe=9, data=7, model=0): 316, ProcessCoord(pipe=9, data=7, model=1): 317, ProcessCoord(pipe=9, data=7, model=2): 318, ProcessCoord(pipe=9, data=7, model=3): 319, ProcessCoord(pipe=10, data=0, model=0): 320, ProcessCoord(pipe=10, data=0, model=1): 321, ProcessCoord(pipe=10, data=0, model=2): 322, ProcessCoord(pipe=10, data=0, model=3): 323, ProcessCoord(pipe=10, data=1, model=0): 324, ProcessCoord(pipe=10, data=1, model=1): 325, ProcessCoord(pipe=10, data=1, model=2): 326, ProcessCoord(pipe=10, data=1, model=3): 327, ProcessCoord(pipe=10, data=2, model=0): 328, ProcessCoord(pipe=10, data=2, model=1): 329, ProcessCoord(pipe=10, data=2, model=2): 330, ProcessCoord(pipe=10, data=2, model=3): 331, ProcessCoord(pipe=10, data=3, model=0): 332, ProcessCoord(pipe=10, data=3, model=1): 333, ProcessCoord(pipe=10, data=3, model=2): 334, ProcessCoord(pipe=10, data=3, model=3): 335, ProcessCoord(pipe=10, data=4, model=0): 336, ProcessCoord(pipe=10, data=4, model=1): 337, ProcessCoord(pipe=10, data=4, model=2): 338, ProcessCoord(pipe=10, data=4, model=3): 339, ProcessCoord(pipe=10, data=5, model=0): 340, ProcessCoord(pipe=10, data=5, model=1): 341, ProcessCoord(pipe=10, data=5, model=2): 342, ProcessCoord(pipe=10, data=5, model=3): 343, ProcessCoord(pipe=10, data=6, model=0): 344, ProcessCoord(pipe=10, data=6, model=1): 345, ProcessCoord(pipe=10, data=6, model=2): 346, ProcessCoord(pipe=10, data=6, model=3): 347, ProcessCoord(pipe=10, data=7, model=0): 348, ProcessCoord(pipe=10, data=7, model=1): 349, ProcessCoord(pipe=10, data=7, model=2): 350, ProcessCoord(pipe=10, data=7, model=3): 351, ProcessCoord(pipe=11, data=0, model=0): 352, ProcessCoord(pipe=11, data=0, model=1): 353, ProcessCoord(pipe=11, data=0, model=2): 354, ProcessCoord(pipe=11, data=0, model=3): 355, ProcessCoord(pipe=11, data=1, model=0): 356, ProcessCoord(pipe=11, data=1, model=1): 357, ProcessCoord(pipe=11, data=1, model=2): 358, ProcessCoord(pipe=11, data=1, model=3): 359, ProcessCoord(pipe=11, data=2, model=0): 360, ProcessCoord(pipe=11, data=2, model=1): 361, ProcessCoord(pipe=11, data=2, model=2): 362, ProcessCoord(pipe=11, data=2, model=3): 363, ProcessCoord(pipe=11, data=3, model=0): 364, ProcessCoord(pipe=11, data=3, model=1): 365, ProcessCoord(pipe=11, data=3, model=2): 366, ProcessCoord(pipe=11, data=3, model=3): 367, ProcessCoord(pipe=11, data=4, model=0): 368, ProcessCoord(pipe=11, data=4, model=1): 369, ProcessCoord(pipe=11, data=4, model=2): 370, ProcessCoord(pipe=11, data=4, model=3): 371, ProcessCoord(pipe=11, data=5, model=0): 372, ProcessCoord(pipe=11, data=5, model=1): 373, ProcessCoord(pipe=11, data=5, model=2): 374, ProcessCoord(pipe=11, data=5, model=3): 375, ProcessCoord(pipe=11, data=6, model=0): 376, ProcessCoord(pipe=11, data=6, model=1): 377, ProcessCoord(pipe=11, data=6, model=2): 378, ProcessCoord(pipe=11, data=6, model=3): 379, ProcessCoord(pipe=11, data=7, model=0): 380, ProcessCoord(pipe=11, data=7, model=1): 381, ProcessCoord(pipe=11, data=7, model=2): 382, ProcessCoord(pipe=11, data=7, model=3): 383} [default0]:[2022-03-03 05:45:11,534] [INFO] [module.py:365:_partition_layers] Partitioning pipeline stages with method type:transformer|embedding [default0]:stage=0 layers=8 [default0]: 0: _to_float16 [default0]: 1: EmbeddingPipe [default0]: 2: <lambda> [default0]: 3: ParallelTransformerLayerPipe [default0]: 4: ParallelTransformerLayerPipe [default0]: 5: ParallelTransformerLayerPipe [default0]: 6: ParallelTransformerLayerPipe [default0]: 7: ParallelTransformerLayerPipe [default0]:stage=1 layers=6 [default0]: 8: ParallelTransformerLayerPipe [default0]: 9: ParallelTransformerLayerPipe [default0]: 10: ParallelTransformerLayerPipe [default0]: 11: ParallelTransformerLayerPipe [default0]: 12: ParallelTransformerLayerPipe [default0]: 13: ParallelTransformerLayerPipe [default0]:stage=2 layers=6 [default0]: 14: ParallelTransformerLayerPipe [default0]: 15: ParallelTransformerLayerPipe [default0]: 16: ParallelTransformerLayerPipe [default0]: 17: ParallelTransformerLayerPipe [default0]: 18: ParallelTransformerLayerPipe [default0]: 19: ParallelTransformerLayerPipe [default0]:stage=3 layers=6 [default0]: 20: ParallelTransformerLayerPipe [default0]: 21: ParallelTransformerLayerPipe [default0]: 22: ParallelTransformerLayerPipe [default0]: 23: ParallelTransformerLayerPipe [default0]: 24: ParallelTransformerLayerPipe [default0]: 25: ParallelTransformerLayerPipe [default0]:stage=4 layers=6 [default0]: 26: ParallelTransformerLayerPipe [default0]: 27: ParallelTransformerLayerPipe [default0]: 28: ParallelTransformerLayerPipe [default0]: 29: ParallelTransformerLayerPipe [default0]: 30: ParallelTransformerLayerPipe [default0]: 31: ParallelTransformerLayerPipe [default0]:stage=5 layers=6 [default0]: 32: ParallelTransformerLayerPipe [default0]: 33: ParallelTransformerLayerPipe [default0]: 34: ParallelTransformerLayerPipe [default0]: 35: ParallelTransformerLayerPipe [default0]: 36: ParallelTransformerLayerPipe [default0]: 37: ParallelTransformerLayerPipe [default0]:stage=6 layers=6 [default0]: 38: ParallelTransformerLayerPipe [default0]: 39: ParallelTransformerLayerPipe [default0]: 40: ParallelTransformerLayerPipe [default0]: 41: ParallelTransformerLayerPipe [default0]: 42: ParallelTransformerLayerPipe [default0]: 43: ParallelTransformerLayerPipe [default0]:stage=7 layers=6 [default0]: 44: ParallelTransformerLayerPipe [default0]: 45: ParallelTransformerLayerPipe [default0]: 46: ParallelTransformerLayerPipe [default0]: 47: ParallelTransformerLayerPipe [default0]: 48: ParallelTransformerLayerPipe [default0]: 49: ParallelTransformerLayerPipe [default0]:stage=8 layers=6 [default0]: 50: ParallelTransformerLayerPipe [default0]: 51: ParallelTransformerLayerPipe [default0]: 52: ParallelTransformerLayerPipe [default0]: 53: ParallelTransformerLayerPipe [default0]: 54: ParallelTransformerLayerPipe [default0]: 55: ParallelTransformerLayerPipe [default0]:stage=9 layers=6 [default0]: 56: ParallelTransformerLayerPipe [default0]: 57: ParallelTransformerLayerPipe [default0]: 58: ParallelTransformerLayerPipe [default0]: 59: ParallelTransformerLayerPipe [default0]: 60: ParallelTransformerLayerPipe [default0]: 61: ParallelTransformerLayerPipe [default0]:stage=10 layers=6 [default0]: 62: ParallelTransformerLayerPipe [default0]: 63: ParallelTransformerLayerPipe [default0]: 64: ParallelTransformerLayerPipe [default0]: 65: ParallelTransformerLayerPipe [default0]: 66: ParallelTransformerLayerPipe [default0]: 67: ParallelTransformerLayerPipe [default0]:stage=11 layers=9 [default0]: 68: ParallelTransformerLayerPipe [default0]: 69: ParallelTransformerLayerPipe [default0]: 70: ParallelTransformerLayerPipe [default0]: 71: ParallelTransformerLayerPipe [default0]: 72: ParallelTransformerLayerPipe [default0]: 73: <lambda> [default0]: 74: MixedFusedLayerNorm [default0]: 75: EmbeddingPipe [default0]: 76: float16_to_fp32 [default0]: loss: CrossEntropy [default0]:[2022-03-03 05:45:12,761] [INFO] [utils.py:828:see_memory_usage] After Building Model [default0]:[2022-03-03 05:45:12,761] [INFO] [utils.py:829:see_memory_usage] MA 7.43 GB Max_MA 7.43 GB CA 7.45 GB Max_CA 7 GB [default0]:[2022-03-03 05:45:12,762] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.6 GB, percent = 8.7% [default0]:setting training iterations to 128728 [default0]:> learning rate decay style: cosine [default0]:DeepSpeed is enabled. [default0]:[2022-03-03 05:45:12,782] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed info: version=0.6.0+ed26ef4, git-hash=ed26ef4, git-branch=olruwase/bf16-updates [default0]:[2022-03-03 05:45:14,566] [INFO] [engine.py:278:__init__] DeepSpeed Flops Profiler Enabled: False [default0]:[2022-03-03 05:45:14,567] [INFO] [engine.py:1092:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer [default0]:[2022-03-03 05:45:14,567] [INFO] [engine.py:1098:_configure_optimizer] Using client Optimizer as basic optimizer [default0]:[2022-03-03 05:45:14,567] [INFO] [engine.py:1114:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam [default0]:[2022-03-03 05:45:14,567] [INFO] [engine.py:1328:_configure_bf16_optimizer] Creating unfused BF16 optimizer [default0]:[2022-03-03 05:45:14,602] [INFO] [utils.py:828:see_memory_usage] begin bf16_optimizer [default0]:[2022-03-03 05:45:14,603] [INFO] [utils.py:829:see_memory_usage] MA 7.42 GB Max_MA 7.43 GB CA 7.45 GB Max_CA 7 GB [default0]:[2022-03-03 05:45:14,603] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.95 GB, percent = 8.7% [default0]:[2022-03-03 05:45:14,624] [INFO] [utils.py:828:see_memory_usage] before initializing group 0 [default0]:[2022-03-03 05:45:14,625] [INFO] [utils.py:829:see_memory_usage] MA 7.42 GB Max_MA 7.42 GB CA 7.45 GB Max_CA 7 GB [default0]:[2022-03-03 05:45:14,625] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.95 GB, percent = 8.7% [default0]:[2022-03-03 05:45:14,675] [INFO] [utils.py:828:see_memory_usage] after initializing group 0 [default0]:[2022-03-03 05:45:14,675] [INFO] [utils.py:829:see_memory_usage] MA 17.01 GB Max_MA 17.01 GB CA 20.23 GB Max_CA 20 GB [default0]:[2022-03-03 05:45:14,675] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.95 GB, percent = 8.7% [default0]:[2022-03-03 05:45:14,696] [INFO] [utils.py:828:see_memory_usage] before initializing group 1 [default0]:[2022-03-03 05:45:14,696] [INFO] [utils.py:829:see_memory_usage] MA 17.01 GB Max_MA 17.01 GB CA 20.23 GB Max_CA 20 GB [default0]:[2022-03-03 05:45:14,696] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.95 GB, percent = 8.7% [default0]:[2022-03-03 05:45:14,738] [INFO] [utils.py:828:see_memory_usage] after initializing group 1 [default0]:[2022-03-03 05:45:14,739] [INFO] [utils.py:829:see_memory_usage] MA 24.11 GB Max_MA 24.11 GB CA 30.5 GB Max_CA 30 GB [default0]:[2022-03-03 05:45:14,739] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.96 GB, percent = 8.7% [default0]:[2022-03-03 05:45:14,760] [INFO] [utils.py:828:see_memory_usage] before initializing group 2 [default0]:[2022-03-03 05:45:14,760] [INFO] [utils.py:829:see_memory_usage] MA 24.11 GB Max_MA 24.11 GB CA 30.5 GB Max_CA 30 GB [default0]:[2022-03-03 05:45:14,760] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.96 GB, percent = 8.7% [default0]:[2022-03-03 05:45:14,782] [INFO] [utils.py:828:see_memory_usage] after initializing group 2 [default0]:[2022-03-03 05:45:14,783] [INFO] [utils.py:829:see_memory_usage] MA 24.12 GB Max_MA 24.12 GB CA 30.5 GB Max_CA 30 GB [default0]:[2022-03-03 05:45:14,783] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.96 GB, percent = 8.7% [default0]:[2022-03-03 05:45:14,804] [INFO] [utils.py:828:see_memory_usage] before initialize_optimizer [default0]:[2022-03-03 05:45:14,804] [INFO] [utils.py:829:see_memory_usage] MA 24.12 GB Max_MA 24.12 GB CA 30.5 GB Max_CA 30 GB [default0]:[2022-03-03 05:45:14,805] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.96 GB, percent = 8.7% [default0]:[2022-03-03 05:45:14,851] [INFO] [utils.py:828:see_memory_usage] end initialize_optimizer [default0]:[2022-03-03 05:45:14,852] [INFO] [utils.py:829:see_memory_usage] MA 27.82 GB Max_MA 27.82 GB CA 34.21 GB Max_CA 34 GB [default0]:[2022-03-03 05:45:14,852] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.96 GB, percent = 8.7% [default0]:[2022-03-03 05:45:14,872] [INFO] [utils.py:828:see_memory_usage] end bf16_optimizer [default0]:[2022-03-03 05:45:14,873] [INFO] [utils.py:829:see_memory_usage] MA 27.82 GB Max_MA 27.82 GB CA 34.21 GB Max_CA 34 GB [default0]:[2022-03-03 05:45:14,873] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.96 GB, percent = 8.7% [default0]:[2022-03-03 05:45:14,873] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam [default0]:[2022-03-03 05:45:14,873] [INFO] [engine.py:795:_configure_lr_scheduler] DeepSpeed using client LR scheduler [default0]:[2022-03-03 05:45:14,873] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed LR Scheduler = <megatron.learning_rates.AnnealingLR object at 0x149408ee1100> [default0]:[2022-03-03 05:45:14,873] [INFO] [logging.py:69:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0, 0.0], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)] [default0]:[2022-03-03 05:45:14,873] [INFO] [config.py:1057:print] DeepSpeedEngine configuration: [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] activation_checkpointing_config { [default0]: "partition_activations": false, [default0]: "contiguous_memory_optimization": false, [default0]: "cpu_checkpointing": false, [default0]: "number_checkpoints": null, [default0]: "synchronize_checkpoint_boundary": false, [default0]: "profile": false [default0]:} [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] amp_enabled .................. False [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] amp_params ................... False [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] autotuning_config ............ { [default0]: "enabled": false, [default0]: "start_step": null, [default0]: "end_step": null, [default0]: "metric_path": null, [default0]: "arg_mappings": null, [default0]: "metric": "throughput", [default0]: "model_info": null, [default0]: "results_dir": null, [default0]: "exps_dir": null, [default0]: "overwrite": true, [default0]: "fast": true, [default0]: "start_profile_step": 3, [default0]: "end_profile_step": 5, [default0]: "tuner_type": "gridsearch", [default0]: "tuner_early_stopping": 5, [default0]: "tuner_num_trials": 50, [default0]: "model_info_path": null, [default0]: "mp_size": 1, [default0]: "max_train_batch_size": null, [default0]: "min_train_batch_size": 1, [default0]: "max_train_micro_batch_size_per_gpu": 1.024000e+03, [default0]: "min_train_micro_batch_size_per_gpu": 1, [default0]: "num_tuning_micro_batch_sizes": 3 [default0]:} [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] bfloat16_enabled ............. True [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] checkpoint_tag_validation_enabled True [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] checkpoint_tag_validation_fail False [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] communication_data_type ...... None [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] curriculum_enabled ........... False [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] curriculum_params ............ False [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] dataloader_drop_last ......... False [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] disable_allgather ............ False [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] dump_state ................... False [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] dynamic_loss_scale_args ...... None [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] eigenvalue_enabled ........... False [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] eigenvalue_gas_boundary_resolution 1 [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] eigenvalue_layer_name ........ bert.encoder.layer [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] eigenvalue_layer_num ......... 0 [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] eigenvalue_max_iter .......... 100 [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] eigenvalue_stability ......... 1e-06 [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] eigenvalue_tol ............... 0.01 [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] eigenvalue_verbose ........... False [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] elasticity_enabled ........... False [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] flops_profiler_config ........ { [default0]: "enabled": false, [default0]: "profile_step": 1, [default0]: "module_depth": -1, [default0]: "top_modules": 1, [default0]: "detailed": true, [default0]: "output_file": null [default0]:} [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] fp16_enabled ................. False [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] fp16_master_weights_and_gradients False [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] fp16_mixed_quantize .......... False [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] global_rank .................. 0 [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] gradient_accumulation_steps .. 128 [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] gradient_clipping ............ 1.0 [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] gradient_predivide_factor .... 1.0 [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] initial_dynamic_scale ........ 1 [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] loss_scale ................... 1.0 [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] memory_breakdown ............. False [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] optimizer_legacy_fusion ...... False [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] optimizer_name ............... None [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] optimizer_params ............. None [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] pld_enabled .................. False [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] pld_params ................... False [default0]:[2022-03-03 05:45:14,874] [INFO] [config.py:1061:print] prescale_gradients ........... False [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print] quantize_change_rate ......... 0.001 [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print] quantize_groups .............. 1 [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print] quantize_offset .............. 1000 [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print] quantize_period .............. 1000 [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print] quantize_rounding ............ 0 [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print] quantize_start_bits .......... 16 [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print] quantize_target_bits ......... 8 [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print] quantize_training_enabled .... False [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print] quantize_type ................ 0 [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print] quantize_verbose ............. False [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print] scheduler_name ............... None [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print] scheduler_params ............. None [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print] sparse_attention ............. None [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print] sparse_gradients_enabled ..... False [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print] steps_per_print .............. 2000 [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print] tensorboard_enabled .......... False [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print] tensorboard_job_name ......... DeepSpeedJobName [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print] tensorboard_output_path ...... [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print] train_batch_size ............. 2048 [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print] train_micro_batch_size_per_gpu 2 [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print] use_quantizer_kernel ......... False [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print] wall_clock_breakdown ......... False [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print] world_size ................... 8 [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print] zero_allow_untested_optimizer False [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print] zero_config .................. { [default0]: "stage": 0, [default0]: "contiguous_gradients": true, [default0]: "reduce_scatter": true, [default0]: "reduce_bucket_size": 5.000000e+08, [default0]: "allgather_partitions": true, [default0]: "allgather_bucket_size": 5.000000e+08, [default0]: "overlap_comm": false, [default0]: "load_from_fp32_weights": true, [default0]: "elastic_checkpoint": false, [default0]: "offload_param": null, [default0]: "offload_optimizer": null, [default0]: "sub_group_size": 1.000000e+09, [default0]: "prefetch_bucket_size": 5.000000e+07, [default0]: "param_persistence_threshold": 1.000000e+05, [default0]: "max_live_parameters": 1.000000e+09, [default0]: "max_reuse_distance": 1.000000e+09, [default0]: "gather_16bit_weights_on_model_save": false, [default0]: "ignore_unused_parameters": true, [default0]: "round_robin_gradients": false, [default0]: "legacy_stage1": false [default0]:} [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print] zero_enabled ................. False [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1061:print] zero_optimization_stage ...... 0 [default0]:[2022-03-03 05:45:14,875] [INFO] [config.py:1063:print] json = { [default0]: "train_micro_batch_size_per_gpu": 2, [default0]: "train_batch_size": 2.048000e+03, [default0]: "gradient_clipping": 1.0, [default0]: "zero_optimization": { [default0]: "stage": 0 [default0]: }, [default0]: "bf16": { [default0]: "enabled": true [default0]: }, [default0]: "steps_per_print": 2.000000e+03, [default0]: "wall_clock_breakdown": false [default0]:} [default0]:[2022-03-03 05:45:14,875] [INFO] [engine.py:93:__init__] CONFIG: micro_batches=128 micro_batch_size=2 [default1]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=129 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=128 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=130 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=131 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=226 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=224 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=227 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=225 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=289 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=288 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=291 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=290 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=34 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=32 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=35 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=33 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=98 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=97 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=96 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=64 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=66 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=99 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=65 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=67 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=193 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=194 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=195 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=192 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=352 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-03 05:45:16,693] [INFO] [engine.py:151:__init__] RANK=322 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=320 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=323 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=321 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-03 05:45:16,693] [INFO] [engine.py:151:__init__] RANK=257 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=162 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=161 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=160 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-03 05:45:16,692] [INFO] [engine.py:151:__init__] RANK=163 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=258 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=256 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=259 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=354 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=353 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=355 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=2 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=1 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=3 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-03 05:45:16,691] [INFO] [engine.py:151:__init__] RANK=0 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default4]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,376] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,376] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,376] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,376] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,368] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,368] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,368] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,368] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,368] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,368] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:time (ms) | load-checkpoint: 8.35 [default1]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,376] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,368] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,376] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,376] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,376] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,376] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,376] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,376] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,376] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,375] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,376] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:WARNING: could not find the metadata file /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default0]: will not load any checkpoints and will start from random [default0]:estimated model parameters: 191.162474496 [default0]:estimated model parameters without embeddings: 148.003086336 [default0]:[after model, optimizer, and learning rate scheduler are built] datetime: 2022-03-03 05:45:17 [default0]:> building train, validation, and test datasets ... [default0]: > datasets target sizes (minimum size): [default0]: train: 220000000 [default0]: validation: 2641920 [default0]: test: 20480 [default0]:> building train, validation, and test datasets for GPT ... [default0]: > building dataset index ... [default1]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,370] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,368] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,371] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default1]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,374] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/utils.py:280: UserWarning: Parameter count with the embeddings will be inaccurate with PP > 1, as the first and last stage hold several copies of the embeddings [default0]: warnings.warn("Parameter count with the embeddings will be inaccurate with PP > 1, as the first and last stage hold several copies of the embeddings") [default1]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default5]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default2]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default3]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default4]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default7]:[2022-03-03 05:45:17,372] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default6]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]:[2022-03-03 05:45:17,373] [WARNING] [engine.py:2482:load_checkpoint] Unable to find latest file at /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.101499 seconds [default0]: number of documents: 1276214 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 1211127) total of 1211127 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (388379) is smaller than 95.0% of number of samples per epoch (471556), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 2.342541 [default0]: using: [default0]: number of documents: 1211127 [default0]: number of epochs: 41 [default0]: sequence length: 2048 [default0]: total number of samples: 19333817 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.257259 [default0]: > building shuffle index with split [0, 18862261) and [18862261, 19333817) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.608458 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_19250640ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_19250640ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_19250640ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.013 seconds [default0]: total number of samples: 19333818 [default0]: total number of epochs: 41 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.014306 seconds [default0]: number of documents: 2218089 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 2104966) total of 2104966 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (190457) is smaller than 95.0% of number of samples per epoch (209202), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 2.091640 [default0]: using: [default0]: number of documents: 2104966 [default0]: number of epochs: 22 [default0]: sequence length: 2048 [default0]: total number of samples: 4602460 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.130691 [default0]: > building shuffle index with split [0, 4393257) and [4393257, 4602460) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.105553 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_4583714ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_4583714ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_4583714ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.015 seconds [default0]: total number of samples: 4602461 [default0]: total number of epochs: 22 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.019053 seconds [default0]: number of documents: 14716427 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 13965889) total of 13965889 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (774480) is smaller than 95.0% of number of samples per epoch (8932197), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 2.399262 [default0]: using: [default0]: number of documents: 13965889 [default0]: number of epochs: 4 [default0]: sequence length: 2048 [default0]: total number of samples: 35728791 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.862422 [default0]: > building shuffle index with split [0, 26796593) and [26796593, 35728791) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 1.125560 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.015 seconds [default0]: total number of samples: 35728792 [default0]: total number of epochs: 4 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.059332 seconds [default0]: number of documents: 2767535 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 2626391) total of 2626391 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (322204) is smaller than 95.0% of number of samples per epoch (1004978), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 3.681606 [default0]: using: [default0]: number of documents: 2626391 [default0]: number of epochs: 28 [default0]: sequence length: 2048 [default0]: total number of samples: 28139392 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.438673 [default0]: > building shuffle index with split [0, 27134414) and [27134414, 28139392) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.986919 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27456618ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27456618ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27456618ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.016 seconds [default0]: total number of samples: 28139393 [default0]: total number of epochs: 28 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.013899 seconds [default0]: number of documents: 786245 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 746147) total of 746147 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (2279) is smaller than 95.0% of number of samples per epoch (30472), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.569235 [default0]: using: [default0]: number of documents: 746147 [default0]: number of epochs: 22 [default0]: sequence length: 2048 [default0]: total number of samples: 670403 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.032689 [default0]: > building shuffle index with split [0, 639930) and [639930, 670403) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.015091 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_642209ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_642209ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_642209ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.010 seconds [default0]: total number of samples: 670404 [default0]: total number of epochs: 22 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.013124 seconds [default0]: number of documents: 1748556 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 1659380) total of 1659380 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (118198) is smaller than 95.0% of number of samples per epoch (499143), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 4.923787 [default0]: using: [default0]: number of documents: 1659380 [default0]: number of epochs: 56 [default0]: sequence length: 2048 [default0]: total number of samples: 27952019 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.411989 [default0]: > building shuffle index with split [0, 27452875) and [27452875, 27952019) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.987607 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.016 seconds [default0]: total number of samples: 27952020 [default0]: total number of epochs: 56 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.028527 seconds [default0]: number of documents: 29464287 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 27961608) total of 27961608 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (286305) is smaller than 95.0% of number of samples per epoch (348542), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 68.374838 [default0]: using: [default0]: number of documents: 27961608 [default0]: number of epochs: 42 [default0]: sequence length: 2048 [default0]: total number of samples: 14638799 [default0]: > elasped time to build and save sample-idx mapping (seconds): 10.501170 [default0]: > building shuffle index with split [0, 14290257) and [14290257, 14638799) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.391336 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_14576562ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_14576562ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_14576562ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.019 seconds [default0]: total number of samples: 14638800 [default0]: total number of epochs: 42 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.006352 seconds [default0]: number of documents: 38304059 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 36350552) total of 36350552 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (24801) is smaller than 95.0% of number of samples per epoch (593669), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 101.234838 [default0]: using: [default0]: number of documents: 36350552 [default0]: number of epochs: 46 [default0]: sequence length: 2048 [default0]: total number of samples: 27308814 [default0]: > elasped time to build and save sample-idx mapping (seconds): 15.445697 [default0]: > building shuffle index with split [0, 26715144) and [26715144, 27308814) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.968035 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_26739945ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_26739945ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_26739945ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.022 seconds [default0]: total number of samples: 27308815 [default0]: total number of epochs: 46 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.003736 seconds [default0]: number of documents: 729667 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 692454) total of 692454 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (294445) is smaller than 95.0% of number of samples per epoch (313064), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.496277 [default0]: using: [default0]: number of documents: 692454 [default0]: number of epochs: 22 [default0]: sequence length: 2048 [default0]: total number of samples: 6887420 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.105572 [default0]: > building shuffle index with split [0, 6574355) and [6574355, 6887420) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.151943 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_6868800ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_6868800ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_6868800ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.013 seconds [default0]: total number of samples: 6887421 [default0]: total number of epochs: 22 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.017804 seconds [default0]: number of documents: 24265522 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 23027980) total of 23027980 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (159718) is smaller than 95.0% of number of samples per epoch (412173), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 33.578622 [default0]: using: [default0]: number of documents: 23027980 [default0]: number of epochs: 25 [default0]: sequence length: 2048 [default0]: total number of samples: 10304342 [default0]: > elasped time to build and save sample-idx mapping (seconds): 5.224553 [default0]: > building shuffle index with split [0, 9892169) and [9892169, 10304342) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.236011 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_10051887ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_10051887ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_10051887ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.016 seconds [default0]: total number of samples: 10304343 [default0]: total number of epochs: 25 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.009976 seconds [default0]: number of documents: 9587455 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 9098495) total of 9098495 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (2061556) is smaller than 95.0% of number of samples per epoch (2892475), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 4.566206 [default0]: using: [default0]: number of documents: 9098495 [default0]: number of epochs: 10 [default0]: sequence length: 2048 [default0]: total number of samples: 28924754 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.934278 [default0]: > building shuffle index with split [0, 26032279) and [26032279, 28924754) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.985260 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_28093835ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_28093835ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_28093835ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.017 seconds [default0]: total number of samples: 28924755 [default0]: total number of epochs: 10 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.021048 seconds [default0]: number of documents: 4335929 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 4114797) total of 4114797 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (362105) is smaller than 95.0% of number of samples per epoch (2720896), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 2.043312 [default0]: using: [default0]: number of documents: 4114797 [default0]: number of epochs: 11 [default0]: sequence length: 2048 [default0]: total number of samples: 29929865 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.432344 [default0]: > building shuffle index with split [0, 27208968) and [27208968, 29929865) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 1.032287 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.015 seconds [default0]: total number of samples: 29929866 [default0]: total number of epochs: 11 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.006202 seconds [default0]: number of documents: 149731 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 142095) total of 142095 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (1829) is smaller than 95.0% of number of samples per epoch (7103), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.060044 [default0]: using: [default0]: number of documents: 142095 [default0]: number of epochs: 18 [default0]: sequence length: 2048 [default0]: total number of samples: 127854 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.017316 [default0]: > building shuffle index with split [0, 120751) and [120751, 127854) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.004353 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_122580ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_122580ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_122580ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.009 seconds [default0]: total number of samples: 127855 [default0]: total number of epochs: 18 [default0]:> building indices for blendable datasets ... [default0]: > sample ratios: [default0]: dataset 0, input: 0.0870676, achieved: 0.0870676 [default0]: dataset 1, input: 0.0207314, achieved: 0.0207314 [default0]: dataset 2, input: 0.1247, achieved: 0.1247 [default0]: dataset 3, input: 0.124182, achieved: 0.124182 [default0]: dataset 4, input: 0.0029046, achieved: 0.0029046 [default0]: dataset 5, input: 0.1247, achieved: 0.1247 [default0]: dataset 6, input: 0.0659275, achieved: 0.0659275 [default0]: dataset 7, input: 0.120941, achieved: 0.120941 [default0]: dataset 8, input: 0.0310665, achieved: 0.0310665 [default0]: dataset 9, input: 0.0454631, achieved: 0.0454631 [default0]: dataset 10, input: 0.127064, achieved: 0.127064 [default0]: dataset 11, input: 0.1247, achieved: 0.1247 [default0]: dataset 12, input: 0.000554406, achieved: 0.000554405 [default0]:> elapsed time for building blendable dataset indices: 4.04 (sec) [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002366 seconds [default0]: number of documents: 1276214 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [1211127, 1274938) total of 63811 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (3428) is smaller than 95.0% of number of samples per epoch (13396), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.026619 [default0]: using: [default0]: number of documents: 63811 [default0]: number of epochs: 18 [default0]: sequence length: 2048 [default0]: total number of samples: 241145 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.005174 [default0]: > building shuffle index with split [0, 227748) and [227748, 241145) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.006570 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_231176ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_231176ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_231176ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.003 seconds [default0]: total number of samples: 241146 [default0]: total number of epochs: 18 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002403 seconds [default0]: number of documents: 2218089 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [2104966, 2215871) total of 110905 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (10348) is smaller than 95.0% of number of samples per epoch (11174), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.015049 [default0]: using: [default0]: number of documents: 110905 [default0]: number of epochs: 5 [default0]: sequence length: 2048 [default0]: total number of samples: 55871 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.003057 [default0]: > building shuffle index with split [0, 44697) and [44697, 55871) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.002864 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_55045ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_55045ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_55045ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.002 seconds [default0]: total number of samples: 55872 [default0]: total number of epochs: 5 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.009845 seconds [default0]: number of documents: 14716427 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [13965889, 14701711) total of 735822 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > only one epoch required, setting separate_last_epoch to False [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.018397 [default0]: using: [default0]: number of documents: 735822 [default0]: number of epochs: 1 [default0]: sequence length: 2048 [default0]: total number of samples: 1880534 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.017694 [default0]: > building shuffle index with split [0, 1880534) and [1880534, 1880534) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.034689 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.009 seconds [default0]: total number of samples: 1880535 [default0]: total number of epochs: 1 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.003629 seconds [default0]: number of documents: 2767535 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [2626391, 2764767) total of 138376 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (89572) is smaller than 95.0% of number of samples per epoch (240148), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.008387 [default0]: using: [default0]: number of documents: 138376 [default0]: number of epochs: 2 [default0]: sequence length: 2048 [default0]: total number of samples: 480296 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.005186 [default0]: > building shuffle index with split [0, 240148) and [240148, 480296) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.009917 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_329720ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_329720ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_329720ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.004 seconds [default0]: total number of samples: 480297 [default0]: total number of epochs: 2 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002055 seconds [default0]: number of documents: 786245 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [746147, 785459) total of 39312 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (288) is smaller than 95.0% of number of samples per epoch (1060), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.009303 [default0]: using: [default0]: number of documents: 39312 [default0]: number of epochs: 8 [default0]: sequence length: 2048 [default0]: total number of samples: 8486 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.001861 [default0]: > building shuffle index with split [0, 7425) and [7425, 8486) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.001641 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_7713ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_7713ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_7713ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.002 seconds [default0]: total number of samples: 8487 [default0]: total number of epochs: 8 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002356 seconds [default0]: number of documents: 1748556 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [1659380, 1746807) total of 87427 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > only one epoch required, setting separate_last_epoch to False [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.004679 [default0]: using: [default0]: number of documents: 87427 [default0]: number of epochs: 1 [default0]: sequence length: 2048 [default0]: total number of samples: 907156 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.004551 [default0]: > building shuffle index with split [0, 907156) and [907156, 907156) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.017715 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.004 seconds [default0]: total number of samples: 907157 [default0]: total number of epochs: 1 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.009854 seconds [default0]: number of documents: 29464287 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [27961608, 29434823) total of 1473215 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (3929) is smaller than 95.0% of number of samples per epoch (15556), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.595725 [default0]: using: [default0]: number of documents: 1473215 [default0]: number of epochs: 12 [default0]: sequence length: 2048 [default0]: total number of samples: 186674 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.030645 [default0]: > building shuffle index with split [0, 171117) and [171117, 186674) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.005210 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_175046ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_175046ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_175046ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.007 seconds [default0]: total number of samples: 186675 [default0]: total number of epochs: 12 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.009861 seconds [default0]: number of documents: 38304059 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [36350552, 38265755) total of 1915203 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (13053) is smaller than 95.0% of number of samples per epoch (25671), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.961445 [default0]: using: [default0]: number of documents: 1915203 [default0]: number of epochs: 13 [default0]: sequence length: 2048 [default0]: total number of samples: 333732 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.045044 [default0]: > building shuffle index with split [0, 308060) and [308060, 333732) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.008467 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_321113ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_321113ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_321113ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.021 seconds [default0]: total number of samples: 333733 [default0]: total number of epochs: 13 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.001923 seconds [default0]: number of documents: 729667 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [692454, 728937) total of 36483 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (3876) is smaller than 95.0% of number of samples per epoch (19652), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.007104 [default0]: using: [default0]: number of documents: 36483 [default0]: number of epochs: 5 [default0]: sequence length: 2048 [default0]: total number of samples: 98263 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.003654 [default0]: > building shuffle index with split [0, 78610) and [78610, 98263) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.003778 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_82486ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_82486ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_82486ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.002 seconds [default0]: total number of samples: 98264 [default0]: total number of epochs: 5 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.010075 seconds [default0]: number of documents: 24265522 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [23027980, 24241256) total of 1213276 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (13145) is smaller than 95.0% of number of samples per epoch (21513), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.200171 [default0]: using: [default0]: number of documents: 1213276 [default0]: number of epochs: 6 [default0]: sequence length: 2048 [default0]: total number of samples: 129079 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.015500 [default0]: > building shuffle index with split [0, 107566) and [107566, 129079) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.003973 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_120711ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_120711ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_120711ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.007 seconds [default0]: total number of samples: 129080 [default0]: total number of epochs: 6 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002745 seconds [default0]: number of documents: 9587455 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [9098495, 9577868) total of 479373 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (24678) is smaller than 95.0% of number of samples per epoch (156347), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.032711 [default0]: using: [default0]: number of documents: 479373 [default0]: number of epochs: 3 [default0]: sequence length: 2048 [default0]: total number of samples: 469041 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.009577 [default0]: > building shuffle index with split [0, 312694) and [312694, 469041) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.010022 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_337372ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_337372ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_337372ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.004 seconds [default0]: total number of samples: 469042 [default0]: total number of epochs: 3 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002281 seconds [default0]: number of documents: 4335929 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [4114797, 4331593) total of 216796 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (131990) is smaller than 95.0% of number of samples per epoch (199104), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.010530 [default0]: using: [default0]: number of documents: 216796 [default0]: number of epochs: 2 [default0]: sequence length: 2048 [default0]: total number of samples: 398208 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.006054 [default0]: > building shuffle index with split [0, 199104) and [199104, 398208) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.008991 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.003 seconds [default0]: total number of samples: 398209 [default0]: total number of epochs: 2 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.000586 seconds [default0]: number of documents: 149731 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [142095, 149581) total of 7486 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (188) is smaller than 95.0% of number of samples per epoch (257), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.003164 [default0]: using: [default0]: number of documents: 7486 [default0]: number of epochs: 6 [default0]: sequence length: 2048 [default0]: total number of samples: 1543 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.001714 [default0]: > building shuffle index with split [0, 1285) and [1285, 1543) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.001593 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_1473ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_1473ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_1473ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.002 seconds [default0]: total number of samples: 1544 [default0]: total number of epochs: 6 [default0]:> building indices for blendable datasets ... [default0]: > sample ratios: [default0]: dataset 0, input: 0.0870676, achieved: 0.0870675 [default0]: dataset 1, input: 0.0207314, achieved: 0.0207315 [default0]: dataset 2, input: 0.1247, achieved: 0.1247 [default0]: dataset 3, input: 0.124182, achieved: 0.124182 [default0]: dataset 4, input: 0.0029046, achieved: 0.00290461 [default0]: dataset 5, input: 0.1247, achieved: 0.1247 [default0]: dataset 6, input: 0.0659275, achieved: 0.0659274 [default0]: dataset 7, input: 0.120941, achieved: 0.120941 [default0]: dataset 8, input: 0.0310665, achieved: 0.0310665 [default0]: dataset 9, input: 0.0454631, achieved: 0.0454631 [default0]: dataset 10, input: 0.127064, achieved: 0.127064 [default0]: dataset 11, input: 0.1247, achieved: 0.1247 [default0]: dataset 12, input: 0.000554406, achieved: 0.000554525 [default0]:> elapsed time for building blendable dataset indices: 0.09 (sec) [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.003739 seconds [default0]: number of documents: 1276214 [default0]: > dataset split: [default0]: test: [default0]: document indices in [1274938, 1276214) total of 1276 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > only one epoch required, setting separate_last_epoch to False [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.001685 [default0]: using: [default0]: number of documents: 1276 [default0]: number of epochs: 1 [default0]: sequence length: 2048 [default0]: total number of samples: 202914 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.002362 [default0]: > building shuffle index with split [0, 202914) and [202914, 202914) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.005445 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1793ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1793ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1793ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.003 seconds [default0]: total number of samples: 202915 [default0]: total number of epochs: 1 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002196 seconds [default0]: number of documents: 2218089 [default0]: > dataset split: [default0]: test: [default0]: document indices in [2215871, 2218089) total of 2218 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (4) is smaller than 95.0% of number of samples per epoch (35), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.002126 [default0]: using: [default0]: number of documents: 2218 [default0]: number of epochs: 13 [default0]: sequence length: 2048 [default0]: total number of samples: 458 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.000543 [default0]: > building shuffle index with split [0, 423) and [423, 458) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.000888 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_427ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_427ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_427ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.002 seconds [default0]: total number of samples: 459 [default0]: total number of epochs: 13 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.001928 seconds [default0]: number of documents: 14716427 [default0]: > dataset split: [default0]: test: [default0]: document indices in [14701711, 14716427) total of 14716 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > only one epoch required, setting separate_last_epoch to False [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.001813 [default0]: using: [default0]: number of documents: 14716 [default0]: number of epochs: 1 [default0]: sequence length: 2048 [default0]: total number of samples: 37486 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.001882 [default0]: > building shuffle index with split [0, 37486) and [37486, 37486) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.002234 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.002 seconds [default0]: total number of samples: 37487 [default0]: total number of epochs: 1 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.001981 seconds [default0]: number of documents: 2767535 [default0]: > dataset split: [default0]: test: [default0]: document indices in [2764767, 2767535) total of 2768 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > only one epoch required, setting separate_last_epoch to False [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.002009 [default0]: using: [default0]: number of documents: 2768 [default0]: number of epochs: 1 [default0]: sequence length: 2048 [default0]: total number of samples: 9925 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.001703 [default0]: > building shuffle index with split [0, 9925) and [9925, 9925) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.002443 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2556ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2556ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2556ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.002 seconds [default0]: total number of samples: 9926 [default0]: total number of epochs: 1 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.001934 seconds [default0]: number of documents: 786245 [default0]: > dataset split: [default0]: test: [default0]: document indices in [785459, 786245) total of 786 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (2) is smaller than 95.0% of number of samples per epoch (19), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.002496 [default0]: using: [default0]: number of documents: 786 [default0]: number of epochs: 4 [default0]: sequence length: 2048 [default0]: total number of samples: 78 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.000519 [default0]: > building shuffle index with split [0, 58) and [58, 78) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.000472 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_60ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_60ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_60ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.002 seconds [default0]: total number of samples: 79 [default0]: total number of epochs: 4 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.001941 seconds [default0]: number of documents: 1748556 [default0]: > dataset split: [default0]: test: [default0]: document indices in [1746807, 1748556) total of 1749 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > only one epoch required, setting separate_last_epoch to False [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.002136 [default0]: using: [default0]: number of documents: 1749 [default0]: number of epochs: 1 [default0]: sequence length: 2048 [default0]: total number of samples: 34095 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.002118 [default0]: > building shuffle index with split [0, 34095) and [34095, 34095) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.002671 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.002 seconds [default0]: total number of samples: 34096 [default0]: total number of epochs: 1 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002022 seconds [default0]: number of documents: 29464287 [default0]: > dataset split: [default0]: test: [default0]: document indices in [29434823, 29464287) total of 29464 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (42) is smaller than 95.0% of number of samples per epoch (328), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.005099 [default0]: using: [default0]: number of documents: 29464 [default0]: number of epochs: 5 [default0]: sequence length: 2048 [default0]: total number of samples: 1644 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.002219 [default0]: > building shuffle index with split [0, 1315) and [1315, 1644) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.002041 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1357ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1357ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1357ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.002 seconds [default0]: total number of samples: 1645 [default0]: total number of epochs: 5 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.001980 seconds [default0]: number of documents: 38304059 [default0]: > dataset split: [default0]: test: [default0]: document indices in [38265755, 38304059) total of 38304 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (268) is smaller than 95.0% of number of samples per epoch (555), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.006891 [default0]: using: [default0]: number of documents: 38304 [default0]: number of epochs: 5 [default0]: sequence length: 2048 [default0]: total number of samples: 2777 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.001766 [default0]: > building shuffle index with split [0, 2222) and [2222, 2777) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.001649 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2490ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2490ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2490ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.002 seconds [default0]: total number of samples: 2778 [default0]: total number of epochs: 5 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.001740 seconds [default0]: number of documents: 729667 [default0]: > dataset split: [default0]: test: [default0]: document indices in [728937, 729667) total of 730 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (283) is smaller than 95.0% of number of samples per epoch (357), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.001725 [default0]: using: [default0]: number of documents: 730 [default0]: number of epochs: 2 [default0]: sequence length: 2048 [default0]: total number of samples: 715 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.001854 [default0]: > building shuffle index with split [0, 357) and [357, 715) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.000560 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_640ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_640ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_640ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.002 seconds [default0]: total number of samples: 716 [default0]: total number of epochs: 2 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.001954 seconds [default0]: number of documents: 24265522 [default0]: > dataset split: [default0]: test: [default0]: document indices in [24241256, 24265522) total of 24266 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (62) is smaller than 95.0% of number of samples per epoch (437), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.003073 [default0]: using: [default0]: number of documents: 24266 [default0]: number of epochs: 3 [default0]: sequence length: 2048 [default0]: total number of samples: 1311 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.001647 [default0]: > building shuffle index with split [0, 874) and [874, 1311) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.002045 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_936ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_936ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_936ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.002 seconds [default0]: total number of samples: 1312 [default0]: total number of epochs: 3 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002023 seconds [default0]: number of documents: 9587455 [default0]: > dataset split: [default0]: test: [default0]: document indices in [9577868, 9587455) total of 9587 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (955) is smaller than 95.0% of number of samples per epoch (1661), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.001864 [default0]: using: [default0]: number of documents: 9587 [default0]: number of epochs: 2 [default0]: sequence length: 2048 [default0]: total number of samples: 3323 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.002170 [default0]: > building shuffle index with split [0, 1661) and [1661, 3323) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.002701 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2616ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2616ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2616ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.002 seconds [default0]: total number of samples: 3324 [default0]: total number of epochs: 2 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002703 seconds [default0]: number of documents: 4335929 [default0]: > dataset split: [default0]: test: [default0]: document indices in [4331593, 4335929) total of 4336 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > only one epoch required, setting separate_last_epoch to False [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.002090 [default0]: using: [default0]: number of documents: 4336 [default0]: number of epochs: 1 [default0]: sequence length: 2048 [default0]: total number of samples: 3963 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.001478 [default0]: > building shuffle index with split [0, 3963) and [3963, 3963) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.002203 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.002 seconds [default0]: total number of samples: 3964 [default0]: total number of epochs: 1 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.000794 seconds [default0]: number of documents: 149731 [default0]: > dataset split: [default0]: test: [default0]: document indices in [149581, 149731) total of 150 documents [default0]: > WARNING: could not find index map files, building the indices on rank 0 ... [default0]: > last epoch number of samples (5) is smaller than 95.0% of number of samples per epoch (7), setting separate_last_epoch to True [default0]: > elasped time to build and save doc-idx mapping (seconds): 0.000444 [default0]: using: [default0]: number of documents: 150 [default0]: number of epochs: 2 [default0]: sequence length: 2048 [default0]: total number of samples: 14 [default0]: > elasped time to build and save sample-idx mapping (seconds): 0.000561 [default0]: > building shuffle index with split [0, 7) and [7, 14) ... [default0]: > elasped time to build and save shuffle-idx mapping (seconds): 0.000543 [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_12ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_12ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_12ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.002 seconds [default0]: total number of samples: 15 [default0]: total number of epochs: 2 [default0]:> building indices for blendable datasets ... [default0]: > sample ratios: [default0]: dataset 0, input: 0.0870676, achieved: 0.0870664 [default0]: dataset 1, input: 0.0207314, achieved: 0.020733 [default0]: dataset 2, input: 0.1247, achieved: 0.124699 [default0]: dataset 3, input: 0.124182, achieved: 0.12418 [default0]: dataset 4, input: 0.0029046, achieved: 0.0029059 [default0]: dataset 5, input: 0.1247, achieved: 0.124699 [default0]: dataset 6, input: 0.0659275, achieved: 0.0659284 [default0]: dataset 7, input: 0.120941, achieved: 0.12094 [default0]: dataset 8, input: 0.0310665, achieved: 0.0310676 [default0]: dataset 9, input: 0.0454631, achieved: 0.0454632 [default0]: dataset 10, input: 0.127064, achieved: 0.127063 [default0]: dataset 11, input: 0.1247, achieved: 0.124699 [default0]: dataset 12, input: 0.000554406, achieved: 0.000555736 [default0]:> elapsed time for building blendable dataset indices: 0.01 (sec) [default0]:> finished creating GPT datasets ... [default1]:[001-004] 177.6021B / 177.6021B [default0]:[000-004] 177.6021B / 177.6021B [default2]:[002-004] 177.6021B / 177.6021B [default0]:[000-009] 177.6021B / 177.6021B [default1]:[001-007] 177.6021B / 177.6021B [default1]:[001-009] 177.6021B / 177.6021B [default1]:[001-003] 177.6021B / 177.6021B [default3]:[003-004] 177.6021B / 177.6021B [default0]:[000-007] 177.6021B / 177.6021B [default3]:[003-009] 177.6021B / 177.6021B [default0]:[000-001] 177.6021B / 177.6021B [default0]:[000-003] 177.6021B / 177.6021B [default2]:[002-006] 177.6021B / 177.6021B [default1]:[001-005] 177.6021B / 177.6021B [default3]:[003-005] 177.6021B / 177.6021B [default2]:[002-005] 177.6021B / 177.6021B [default2]:[002-003] 177.6021B / 177.6021B [default1]:[001-002] 177.6021B / 177.6021B [default3]:[003-010] 177.6021B / 177.6021B [default3]:[003-007] 177.6021B / 177.6021B [default2]:[002-001] 177.6021B / 177.6021B [default3]:[003-001] 177.6021B / 177.6021B [default1]:[001-006] 177.6021B / 177.6021B [default0]:[000-006] 177.6021B / 177.6021B [default2]:[002-010] 177.6021B / 177.6021B [default0]:[000-002] 177.6021B / 177.6021B [default2]:[002-009] 177.6021B / 177.6021B [default7]:time (ms) | model-and-optimizer-setup: 7875.21 | train/valid/test-data-iterators-setup: 280711.35 [default0]:[000-010] 177.6021B / 177.6021B [default0]:[000-011] 191.1639B / 148.0045B [default0]:[000-005] 177.6021B / 177.6021B [default3]:[003-006] 177.6021B / 177.6021B [default3]:[003-002] 177.6021B / 177.6021B [default3]:[003-003] 177.6021B / 177.6021B [default2]:[002-011] 191.1639B / 148.0045B [default1]:[001-010] 177.6021B / 177.6021B [default1]:[001-011] 191.1639B / 148.0045B [default2]:[002-007] 177.6021B / 177.6021B [default1]:[001-001] 177.6021B / 177.6021B [default2]:[002-002] 177.6021B / 177.6021B [default3]:[003-011] 191.1639B / 148.0045B [default2]:[002-000] 191.1625B / 148.0031B [default1]:[001-000] 191.1625B / 148.0031B [default3]:[003-000] 191.1625B / 148.0031B [default0]:[after dataloaders are built] datetime: 2022-03-03 05:49:58 [default0]:done with setup ... [default0]:training ... [default0]:Number of parameters: [tensor rank - pipeline rank] w/ and w/o embeddings: [default0]:[000-000] 191.1625B / 148.0031B [default0]:[before the start of training step] datetime: 2022-03-03 05:49:58 [default0]:[2022-03-03 05:49:58,855] [INFO] [checkpointing.py:547:forward] Activation Checkpointing Information [default0]:[2022-03-03 05:49:58,855] [INFO] [checkpointing.py:548:forward] ----Partition Activations False, CPU CHECKPOINTING False [default0]:[2022-03-03 05:49:58,855] [INFO] [checkpointing.py:551:forward] ----contiguous Memory Checkpointing False with 70 total layers [default0]:[2022-03-03 05:49:58,855] [INFO] [checkpointing.py:554:forward] ----Synchronization False [default0]:[2022-03-03 05:49:58,855] [INFO] [checkpointing.py:555:forward] ----Profiling time in checkpointing False [default1]:[001-008] 177.6021B / 177.6021B [default2]:[002-008] 177.6021B / 177.6021B [default0]:[000-008] 177.6021B / 177.6021B [default3]:[003-008] 177.6021B / 177.6021B [default3]:[Rank 323] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default3]:[Rank 35] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default3]:[Rank 227] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default3]:[Rank 163] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default3]:[Rank 195] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default7]: iteration 1/ 128728 | consumed samples: 16 | consumed tokens: 32768 | elapsed time per iteration (s): 40.31 | learning rate: 5.243E-09 | global batch size: 16 | lm loss: 6.158806E+01 | grad norm: 17.319 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 0.397 | TFLOPs: 3.04 | [default3]:[Rank 67] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default3]:[Rank 99] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default3]:[Rank 355] (after 1 iterations) memory (MB) | allocated: 29724.1103515625 | max allocated: 41683.2236328125 | reserved: 48100.0 | max reserved: 48100.0 [default3]:[Rank 3] (after 1 iterations) memory (MB) | allocated: 28523.97509765625 | max allocated: 40483.08837890625 | reserved: 47960.0 | max reserved: 47960.0 [default3]:[Rank 259] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default3]:[Rank 131] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default3]:[Rank 291] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default1]:[Rank 97] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default1]:[Rank 65] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default1]:[Rank 161] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default1]:[Rank 193] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default1]:[Rank 321] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default1]:[Rank 353] (after 1 iterations) memory (MB) | allocated: 29724.1103515625 | max allocated: 41683.2236328125 | reserved: 48100.0 | max reserved: 48100.0 [default1]:[Rank 33] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default1]:[Rank 1] (after 1 iterations) memory (MB) | allocated: 28523.97509765625 | max allocated: 40483.08837890625 | reserved: 47960.0 | max reserved: 47960.0 [default1]:[Rank 257] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default1]:[Rank 129] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default1]:[Rank 225] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default1]:[Rank 289] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default0]:[Rank 256] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default0]:[Rank 128] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default0]:[Rank 288] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default0]:[Rank 224] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default0]:[Rank 32] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default0]:[Rank 96] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default0]:[Rank 192] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default0]:[Rank 64] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default0]:[Rank 320] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default0]:[Rank 352] (after 1 iterations) memory (MB) | allocated: 29724.1103515625 | max allocated: 41683.2236328125 | reserved: 48100.0 | max reserved: 48100.0 [default0]:[Rank 160] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default2]:[Rank 2] (after 1 iterations) memory (MB) | allocated: 28523.97509765625 | max allocated: 40483.08837890625 | reserved: 47960.0 | max reserved: 47960.0 [default0]:[Rank 0] (after 1 iterations) memory (MB) | allocated: 28523.97509765625 | max allocated: 40483.08837890625 | reserved: 47960.0 | max reserved: 47960.0 [default2]:[Rank 258] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default2]:[Rank 130] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default2]:[Rank 194] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default2]:[Rank 34] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default2]:[Rank 98] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default2]:[Rank 322] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default2]:[Rank 290] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default2]:[Rank 162] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default2]:[Rank 354] (after 1 iterations) memory (MB) | allocated: 29724.1103515625 | max allocated: 41683.2236328125 | reserved: 48100.0 | max reserved: 48100.0 [default2]:[Rank 226] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default2]:[Rank 66] (after 1 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41920.0 | max reserved: 41920.0 [default7]: iteration 2/ 128728 | consumed samples: 32 | consumed tokens: 65536 | elapsed time per iteration (s): 14.54 | learning rate: 1.049E-08 | global batch size: 16 | lm loss: 6.161202E+01 | grad norm: 17.327 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.101 | TFLOPs: 8.43 | [default7]: iteration 3/ 128728 | consumed samples: 48 | consumed tokens: 98304 | elapsed time per iteration (s): 14.86 | learning rate: 1.573E-08 | global batch size: 16 | lm loss: 6.159873E+01 | grad norm: 17.827 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.076 | TFLOPs: 8.24 | [default7]: iteration 4/ 128728 | consumed samples: 64 | consumed tokens: 131072 | elapsed time per iteration (s): 14.80 | learning rate: 2.097E-08 | global batch size: 16 | lm loss: 6.156909E+01 | grad norm: 17.233 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.081 | TFLOPs: 8.28 | [default7]: iteration 5/ 128728 | consumed samples: 80 | consumed tokens: 163840 | elapsed time per iteration (s): 14.83 | learning rate: 2.621E-08 | global batch size: 16 | lm loss: 6.158672E+01 | grad norm: 17.710 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.079 | TFLOPs: 8.26 | [default7]: iteration 6/ 128728 | consumed samples: 96 | consumed tokens: 196608 | elapsed time per iteration (s): 14.79 | learning rate: 3.146E-08 | global batch size: 16 | lm loss: 6.160669E+01 | grad norm: 17.643 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.082 | TFLOPs: 8.28 | [default7]: iteration 7/ 128728 | consumed samples: 112 | consumed tokens: 229376 | elapsed time per iteration (s): 14.87 | learning rate: 3.670E-08 | global batch size: 16 | lm loss: 6.159612E+01 | grad norm: 17.265 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.076 | TFLOPs: 8.24 | [default7]: iteration 8/ 128728 | consumed samples: 128 | consumed tokens: 262144 | elapsed time per iteration (s): 14.78 | learning rate: 4.194E-08 | global batch size: 16 | lm loss: 6.157154E+01 | grad norm: 17.928 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.082 | TFLOPs: 8.29 | [default7]: iteration 9/ 128728 | consumed samples: 144 | consumed tokens: 294912 | elapsed time per iteration (s): 14.90 | learning rate: 4.719E-08 | global batch size: 16 | lm loss: 6.151357E+01 | grad norm: 18.992 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.074 | TFLOPs: 8.22 | [default7]: iteration 10/ 128728 | consumed samples: 160 | consumed tokens: 327680 | elapsed time per iteration (s): 14.84 | learning rate: 5.243E-08 | global batch size: 16 | lm loss: 6.143620E+01 | grad norm: 19.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.078 | TFLOPs: 8.26 | [default7]: iteration 11/ 128728 | consumed samples: 176 | consumed tokens: 360448 | elapsed time per iteration (s): 14.92 | learning rate: 5.767E-08 | global batch size: 16 | lm loss: 6.150426E+01 | grad norm: 20.036 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.072 | TFLOPs: 8.21 | [default7]: iteration 12/ 128728 | consumed samples: 192 | consumed tokens: 393216 | elapsed time per iteration (s): 14.78 | learning rate: 6.291E-08 | global batch size: 16 | lm loss: 6.130256E+01 | grad norm: 22.535 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.082 | TFLOPs: 8.29 | [default7]: iteration 13/ 128728 | consumed samples: 208 | consumed tokens: 425984 | elapsed time per iteration (s): 14.86 | learning rate: 6.816E-08 | global batch size: 16 | lm loss: 6.122111E+01 | grad norm: 23.767 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.077 | TFLOPs: 8.24 | [default7]: iteration 14/ 128728 | consumed samples: 224 | consumed tokens: 458752 | elapsed time per iteration (s): 14.80 | learning rate: 7.340E-08 | global batch size: 16 | lm loss: 6.115615E+01 | grad norm: 25.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.081 | TFLOPs: 8.28 | [default7]: iteration 15/ 128728 | consumed samples: 240 | consumed tokens: 491520 | elapsed time per iteration (s): 14.79 | learning rate: 7.864E-08 | global batch size: 16 | lm loss: 6.112857E+01 | grad norm: 24.395 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.082 | TFLOPs: 8.28 | [default7]: iteration 16/ 128728 | consumed samples: 256 | consumed tokens: 524288 | elapsed time per iteration (s): 14.80 | learning rate: 8.389E-08 | global batch size: 16 | lm loss: 5.982215E+01 | grad norm: 40.392 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.081 | TFLOPs: 8.28 | [default7]: iteration 17/ 128728 | consumed samples: 272 | consumed tokens: 557056 | elapsed time per iteration (s): 14.90 | learning rate: 8.913E-08 | global batch size: 16 | lm loss: 5.965714E+01 | grad norm: 43.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.074 | TFLOPs: 8.22 | [default7]: iteration 18/ 128728 | consumed samples: 288 | consumed tokens: 589824 | elapsed time per iteration (s): 14.98 | learning rate: 9.437E-08 | global batch size: 16 | lm loss: 5.951318E+01 | grad norm: 44.380 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.068 | TFLOPs: 8.18 | [default7]: iteration 19/ 128728 | consumed samples: 304 | consumed tokens: 622592 | elapsed time per iteration (s): 14.78 | learning rate: 9.961E-08 | global batch size: 16 | lm loss: 5.903408E+01 | grad norm: 48.915 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.082 | TFLOPs: 8.29 | [default7]: iteration 20/ 128728 | consumed samples: 320 | consumed tokens: 655360 | elapsed time per iteration (s): 14.87 | learning rate: 1.049E-07 | global batch size: 16 | lm loss: 5.875332E+01 | grad norm: 50.875 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.076 | TFLOPs: 8.24 | [default7]: iteration 21/ 128728 | consumed samples: 336 | consumed tokens: 688128 | elapsed time per iteration (s): 14.86 | learning rate: 1.101E-07 | global batch size: 16 | lm loss: 5.413025E+01 | grad norm: 85.585 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.076 | TFLOPs: 8.24 | [default7]: iteration 22/ 128728 | consumed samples: 352 | consumed tokens: 720896 | elapsed time per iteration (s): 14.92 | learning rate: 1.153E-07 | global batch size: 16 | lm loss: 5.085058E+01 | grad norm: 93.942 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.072 | TFLOPs: 8.21 | [default7]: iteration 23/ 128728 | consumed samples: 368 | consumed tokens: 753664 | elapsed time per iteration (s): 14.81 | learning rate: 1.206E-07 | global batch size: 16 | lm loss: 4.981078E+01 | grad norm: 96.976 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.080 | TFLOPs: 8.27 | [default7]: iteration 24/ 128728 | consumed samples: 384 | consumed tokens: 786432 | elapsed time per iteration (s): 14.82 | learning rate: 1.258E-07 | global batch size: 16 | lm loss: 4.871767E+01 | grad norm: 99.005 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.080 | TFLOPs: 8.27 | [default7]: iteration 25/ 128728 | consumed samples: 400 | consumed tokens: 819200 | elapsed time per iteration (s): 14.83 | learning rate: 1.311E-07 | global batch size: 16 | lm loss: 4.742308E+01 | grad norm: 101.865 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.079 | TFLOPs: 8.26 | [default7]: iteration 26/ 128728 | consumed samples: 416 | consumed tokens: 851968 | elapsed time per iteration (s): 14.79 | learning rate: 1.363E-07 | global batch size: 16 | lm loss: 4.459019E+01 | grad norm: 103.893 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.082 | TFLOPs: 8.28 | [default7]: iteration 27/ 128728 | consumed samples: 432 | consumed tokens: 884736 | elapsed time per iteration (s): 14.85 | learning rate: 1.416E-07 | global batch size: 16 | lm loss: 4.345989E+01 | grad norm: 103.374 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.077 | TFLOPs: 8.25 | [default7]: iteration 28/ 128728 | consumed samples: 448 | consumed tokens: 917504 | elapsed time per iteration (s): 14.86 | learning rate: 1.468E-07 | global batch size: 16 | lm loss: 4.248281E+01 | grad norm: 102.881 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.077 | TFLOPs: 8.25 | [default7]: iteration 29/ 128728 | consumed samples: 464 | consumed tokens: 950272 | elapsed time per iteration (s): 14.87 | learning rate: 1.520E-07 | global batch size: 16 | lm loss: 3.440926E+01 | grad norm: 90.796 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.076 | TFLOPs: 8.24 | [default7]: iteration 30/ 128728 | consumed samples: 480 | consumed tokens: 983040 | elapsed time per iteration (s): 14.79 | learning rate: 1.573E-07 | global batch size: 16 | lm loss: 3.089366E+01 | grad norm: 79.215 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.081 | TFLOPs: 8.28 | [default7]: iteration 31/ 128728 | consumed samples: 496 | consumed tokens: 1015808 | elapsed time per iteration (s): 14.79 | learning rate: 1.625E-07 | global batch size: 16 | lm loss: 2.933587E+01 | grad norm: 73.870 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.082 | TFLOPs: 8.28 | [default7]: iteration 32/ 128728 | consumed samples: 512 | consumed tokens: 1048576 | elapsed time per iteration (s): 14.83 | learning rate: 1.678E-07 | global batch size: 16 | lm loss: 2.763102E+01 | grad norm: 68.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.079 | TFLOPs: 8.26 | [default7]: iteration 33/ 128728 | consumed samples: 528 | consumed tokens: 1081344 | elapsed time per iteration (s): 14.77 | learning rate: 1.730E-07 | global batch size: 16 | lm loss: 2.619627E+01 | grad norm: 63.092 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.084 | TFLOPs: 8.30 | [default7]: iteration 34/ 128728 | consumed samples: 544 | consumed tokens: 1114112 | elapsed time per iteration (s): 14.77 | learning rate: 1.783E-07 | global batch size: 16 | lm loss: 2.509729E+01 | grad norm: 59.336 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.083 | TFLOPs: 8.29 | [default7]: iteration 35/ 128728 | consumed samples: 560 | consumed tokens: 1146880 | elapsed time per iteration (s): 14.86 | learning rate: 1.835E-07 | global batch size: 16 | lm loss: 2.208402E+01 | grad norm: 48.809 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.077 | TFLOPs: 8.25 | [default7]: iteration 36/ 128728 | consumed samples: 576 | consumed tokens: 1179648 | elapsed time per iteration (s): 14.79 | learning rate: 1.887E-07 | global batch size: 16 | lm loss: 2.048165E+01 | grad norm: 43.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.082 | TFLOPs: 8.28 | [default7]: iteration 37/ 128728 | consumed samples: 592 | consumed tokens: 1212416 | elapsed time per iteration (s): 14.90 | learning rate: 1.940E-07 | global batch size: 16 | lm loss: 1.919763E+01 | grad norm: 38.862 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.074 | TFLOPs: 8.22 | [default7]: iteration 38/ 128728 | consumed samples: 608 | consumed tokens: 1245184 | elapsed time per iteration (s): 14.76 | learning rate: 1.992E-07 | global batch size: 16 | lm loss: 1.835708E+01 | grad norm: 35.805 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.084 | TFLOPs: 8.30 | [default7]: iteration 39/ 128728 | consumed samples: 624 | consumed tokens: 1277952 | elapsed time per iteration (s): 14.85 | learning rate: 2.045E-07 | global batch size: 16 | lm loss: 1.753267E+01 | grad norm: 33.059 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.077 | TFLOPs: 8.25 | [default7]: iteration 40/ 128728 | consumed samples: 640 | consumed tokens: 1310720 | elapsed time per iteration (s): 14.90 | learning rate: 2.097E-07 | global batch size: 16 | lm loss: 1.669237E+01 | grad norm: 30.094 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.074 | TFLOPs: 8.22 | [default7]: iteration 41/ 128728 | consumed samples: 656 | consumed tokens: 1343488 | elapsed time per iteration (s): 14.76 | learning rate: 2.150E-07 | global batch size: 16 | lm loss: 1.602054E+01 | grad norm: 27.577 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.084 | TFLOPs: 8.30 | [default7]: iteration 42/ 128728 | consumed samples: 672 | consumed tokens: 1376256 | elapsed time per iteration (s): 14.78 | learning rate: 2.202E-07 | global batch size: 16 | lm loss: 1.524471E+01 | grad norm: 24.230 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.083 | TFLOPs: 8.29 | [default7]: iteration 43/ 128728 | consumed samples: 688 | consumed tokens: 1409024 | elapsed time per iteration (s): 14.73 | learning rate: 2.254E-07 | global batch size: 16 | lm loss: 1.467593E+01 | grad norm: 21.341 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.087 | TFLOPs: 8.32 | [default7]: iteration 44/ 128728 | consumed samples: 704 | consumed tokens: 1441792 | elapsed time per iteration (s): 14.84 | learning rate: 2.307E-07 | global batch size: 16 | lm loss: 1.369703E+01 | grad norm: 15.454 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.078 | TFLOPs: 8.26 | [default7]: iteration 45/ 128728 | consumed samples: 720 | consumed tokens: 1474560 | elapsed time per iteration (s): 14.87 | learning rate: 2.359E-07 | global batch size: 16 | lm loss: 1.321554E+01 | grad norm: 12.768 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.076 | TFLOPs: 8.24 | [default7]: iteration 46/ 128728 | consumed samples: 736 | consumed tokens: 1507328 | elapsed time per iteration (s): 14.77 | learning rate: 2.412E-07 | global batch size: 16 | lm loss: 1.281323E+01 | grad norm: 11.024 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.084 | TFLOPs: 8.30 | [default7]: iteration 47/ 128728 | consumed samples: 752 | consumed tokens: 1540096 | elapsed time per iteration (s): 14.91 | learning rate: 2.464E-07 | global batch size: 16 | lm loss: 1.263766E+01 | grad norm: 8.627 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.073 | TFLOPs: 8.22 | [default7]: iteration 48/ 128728 | consumed samples: 768 | consumed tokens: 1572864 | elapsed time per iteration (s): 14.98 | learning rate: 2.517E-07 | global batch size: 16 | lm loss: 1.236759E+01 | grad norm: 4.747 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.068 | TFLOPs: 8.18 | [default7]: iteration 49/ 128728 | consumed samples: 784 | consumed tokens: 1605632 | elapsed time per iteration (s): 14.91 | learning rate: 2.569E-07 | global batch size: 16 | lm loss: 1.218161E+01 | grad norm: 3.745 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.073 | TFLOPs: 8.22 | [default7]: iteration 50/ 128728 | consumed samples: 800 | consumed tokens: 1638400 | elapsed time per iteration (s): 14.78 | learning rate: 2.621E-07 | global batch size: 16 | lm loss: 1.218425E+01 | grad norm: 2.795 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.083 | TFLOPs: 8.29 | [default0]:saving checkpoint at iteration 50 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default0]:[2022-03-03 06:03:02,742] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/mp_rank_00_model_states.pt [default1]:[2022-03-03 06:03:02,974] [INFO] [logging.py:69:log_dist] [Rank 1] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/mp_rank_01_model_states.pt [default7]:[2022-03-03 06:03:08,125] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_07_optim_states.pt [default1]:[2022-03-03 06:03:08,268] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_05_optim_states.pt [default2]:[2022-03-03 06:03:08,588] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_06_optim_states.pt [default0]:[2022-03-03 06:03:08,609] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_04_optim_states.pt [default2]:[2022-03-03 06:03:08,602] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_38_optim_states.pt [default5]:[2022-03-03 06:03:08,748] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_05_optim_states.pt [default1]:[2022-03-03 06:03:08,802] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_37_optim_states.pt [default4]:[2022-03-03 06:03:08,738] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_04_optim_states.pt [default6]:[2022-03-03 06:03:08,837] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_06_optim_states.pt [default5]:[2022-03-03 06:03:08,964] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_13_optim_states.pt [default6]:[2022-03-03 06:03:09,050] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_14_optim_states.pt [default7]:[2022-03-03 06:03:09,183] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_39_optim_states.pt [default0]:[2022-03-03 06:03:09,190] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_36_optim_states.pt [default1]:[2022-03-03 06:03:09,210] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_13_optim_states.pt [default5]:[2022-03-03 06:03:09,154] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_37_optim_states.pt [default6]:[2022-03-03 06:03:09,177] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_38_optim_states.pt [default3]:[2022-03-03 06:03:09,309] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_07_optim_states.pt [default2]:[2022-03-03 06:03:09,343] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_14_optim_states.pt [default3]:[2022-03-03 06:03:09,416] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_15_optim_states.pt [default4]:[2022-03-03 06:03:09,531] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_36_optim_states.pt [default0]:[2022-03-03 06:03:09,496] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_12_optim_states.pt [default3]:[2022-03-03 06:03:09,541] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_39_optim_states.pt [default7]:[2022-03-03 06:03:09,710] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_15_optim_states.pt [default4]:[2022-03-03 06:03:09,733] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_12_optim_states.pt [default2]:[2022-03-03 06:03:11,155] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_10_optim_states.pt [default5]:[2022-03-03 06:03:11,192] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_09_optim_states.pt [default7]:[2022-03-03 06:03:11,241] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_43_optim_states.pt [default2]:[2022-03-03 06:03:11,359] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_42_optim_states.pt [default6]:[2022-03-03 06:03:11,417] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_42_optim_states.pt [default5]:[2022-03-03 06:03:11,393] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_09_optim_states.pt [default3]:[2022-03-03 06:03:11,485] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_43_optim_states.pt [default5]:[2022-03-03 06:03:11,598] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_41_optim_states.pt [default7]:[2022-03-03 06:03:11,637] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_19_optim_states.pt [default4]:[2022-03-03 06:03:11,693] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_24_optim_states.pt [default7]:[2022-03-03 06:03:11,672] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_11_optim_states.pt [default5]:[2022-03-03 06:03:11,758] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_01_optim_states.pt [default5]:[2022-03-03 06:03:11,693] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_25_optim_states.pt [default3]:[2022-03-03 06:03:11,898] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_11_optim_states.pt [default7]:[2022-03-03 06:03:11,791] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_27_optim_states.pt [default2]:[2022-03-03 06:03:11,859] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_26_optim_states.pt [default1]:[2022-03-03 06:03:11,877] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_33_optim_states.pt [default6]:[2022-03-03 06:03:11,949] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_14_optim_states.pt [default0]:[2022-03-03 06:03:12,050] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_32_optim_states.pt [default4]:[2022-03-03 06:03:11,950] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_08_optim_states.pt [default4]:[2022-03-03 06:03:12,036] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_08_optim_states.pt [default1]:[2022-03-03 06:03:12,064] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_25_optim_states.pt [default1]:[2022-03-03 06:03:12,118] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_41_optim_states.pt [default6]:[2022-03-03 06:03:12,182] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_26_optim_states.pt [default0]:[2022-03-03 06:03:12,140] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_08_optim_states.pt [default3]:[2022-03-03 06:03:12,082] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_11_optim_states.pt [default1]:[2022-03-03 06:03:12,225] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_09_optim_states.pt [default0]:[2022-03-03 06:03:12,297] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_08_optim_states.pt [default2]:[2022-03-03 06:03:12,311] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_10_optim_states.pt [default4]:[2022-03-03 06:03:12,293] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_40_optim_states.pt [default2]:[2022-03-03 06:03:12,376] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_06_optim_states.pt [default3]:[2022-03-03 06:03:12,432] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_27_optim_states.pt [default4]:[2022-03-03 06:03:12,366] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt [default1]:[2022-03-03 06:03:12,379] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_09_optim_states.pt [default0]:[2022-03-03 06:03:12,487] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_24_optim_states.pt [default4]:[2022-03-03 06:03:12,424] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_04_optim_states.pt [default6]:[2022-03-03 06:03:12,494] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_10_optim_states.pt [default5]:[2022-03-03 06:03:12,446] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_05_optim_states.pt [default1]:[2022-03-03 06:03:12,468] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_37_optim_states.pt [default3]:[2022-03-03 06:03:12,461] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_07_optim_states.pt [default0]:[2022-03-03 06:03:12,496] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_36_optim_states.pt [default2]:[2022-03-03 06:03:12,594] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_18_optim_states.pt [default2]:[2022-03-03 06:03:12,550] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_02_optim_states.pt [default7]:[2022-03-03 06:03:12,624] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_11_optim_states.pt [default6]:[2022-03-03 06:03:12,662] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_38_optim_states.pt [default6]:[2022-03-03 06:03:12,688] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_02_optim_states.pt [default3]:[2022-03-03 06:03:12,738] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_03_optim_states.pt [default6]:[2022-03-03 06:03:12,698] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_18_optim_states.pt [default2]:[2022-03-03 06:03:12,835] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_38_optim_states.pt [default5]:[2022-03-03 06:03:12,835] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_37_optim_states.pt [default6]:[2022-03-03 06:03:12,968] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_10_optim_states.pt [default5]:[2022-03-03 06:03:13,012] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_01_optim_states.pt [default1]:[2022-03-03 06:03:12,990] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_01_optim_states.pt [default4]:[2022-03-03 06:03:13,031] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_36_optim_states.pt [default2]:[2022-03-03 06:03:13,099] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_02_optim_states.pt [default3]:[2022-03-03 06:03:13,101] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_03_optim_states.pt [default7]:[2022-03-03 06:03:13,116] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_39_optim_states.pt [default3]:[2022-03-03 06:03:13,145] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_39_optim_states.pt [default7]:[2022-03-03 06:03:13,130] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_27_optim_states.pt [default7]:[2022-03-03 06:03:13,236] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_43_optim_states.pt [default3]:[2022-03-03 06:03:13,338] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_43_optim_states.pt [default2]:[2022-03-03 06:03:13,316] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_42_optim_states.pt [default6]:[2022-03-03 06:03:13,299] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_42_optim_states.pt [default7]:[2022-03-03 06:03:13,324] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_39_optim_states.pt [default6]:[2022-03-03 06:03:13,330] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_46_optim_states.pt [default6]:[2022-03-03 06:03:13,414] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_02_optim_states.pt [default4]:[2022-03-03 06:03:13,417] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_44_optim_states.pt [default1]:[2022-03-03 06:03:13,467] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_37_optim_states.pt [default5]:[2022-03-03 06:03:13,429] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_37_optim_states.pt [default1]:[2022-03-03 06:03:13,431] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_45_optim_states.pt [default2]:[2022-03-03 06:03:13,496] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_34_optim_states.pt [default3]:[2022-03-03 06:03:13,468] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_43_optim_states.pt [default2]:[2022-03-03 06:03:13,464] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_42_optim_states.pt [default0]:[2022-03-03 06:03:13,444] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_44_optim_states.pt [default7]:[2022-03-03 06:03:13,459] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_47_optim_states.pt [default5]:[2022-03-03 06:03:13,490] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_45_optim_states.pt [default3]:[2022-03-03 06:03:13,596] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_35_optim_states.pt [default7]:[2022-03-03 06:03:13,615] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_03_optim_states.pt [default5]:[2022-03-03 06:03:13,597] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_13_optim_states.pt [default3]:[2022-03-03 06:03:13,687] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_39_optim_states.pt [default3]:[2022-03-03 06:03:13,719] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_15_optim_states.pt [default6]:[2022-03-03 06:03:13,736] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_26_optim_states.pt [default6]:[2022-03-03 06:03:13,745] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_38_optim_states.pt [default3]:[2022-03-03 06:03:13,739] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_47_optim_states.pt [default6]:[2022-03-03 06:03:13,785] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_34_optim_states.pt [default2]:[2022-03-03 06:03:13,763] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_46_optim_states.pt [default5]:[2022-03-03 06:03:13,849] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_13_optim_states.pt [default4]:[2022-03-03 06:03:13,854] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_12_optim_states.pt [default4]:[2022-03-03 06:03:13,800] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_12_optim_states.pt [default1]:[2022-03-03 06:03:13,884] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_13_optim_states.pt [default4]:[2022-03-03 06:03:13,964] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_36_optim_states.pt [default0]:[2022-03-03 06:03:13,968] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_36_optim_states.pt [default2]:[2022-03-03 06:03:14,128] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_38_optim_states.pt [default4]:[2022-03-03 06:03:14,254] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_16_optim_states.pt [default0]:[2022-03-03 06:03:14,166] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt [default0]:[2022-03-03 06:03:14,426] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_24_optim_states.pt [default0]:[2022-03-03 06:03:14,609] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_40_optim_states.pt [default1]:[2022-03-03 06:03:14,699] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_01_optim_states.pt [default0]:[2022-03-03 06:03:14,680] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_24_optim_states.pt [default3]:[2022-03-03 06:03:14,714] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_15_optim_states.pt [default0]:[2022-03-03 06:03:14,767] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt [default1]:[2022-03-03 06:03:14,757] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_01_optim_states.pt [default3]:[2022-03-03 06:03:14,815] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_03_optim_states.pt [default2]:[2022-03-03 06:03:14,793] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_18_optim_states.pt [default6]:[2022-03-03 06:03:14,711] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_38_optim_states.pt [default1]:[2022-03-03 06:03:14,789] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_25_optim_states.pt [default2]:[2022-03-03 06:03:14,790] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_14_optim_states.pt [default1]:[2022-03-03 06:03:14,880] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_21_optim_states.pt [default2]:[2022-03-03 06:03:14,809] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_02_optim_states.pt [default4]:[2022-03-03 06:03:14,821] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_32_optim_states.pt [default4]:[2022-03-03 06:03:14,844] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt [default0]:[2022-03-03 06:03:14,850] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt [default4]:[2022-03-03 06:03:14,894] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_40_optim_states.pt [default1]:[2022-03-03 06:03:14,869] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_33_optim_states.pt [default5]:[2022-03-03 06:03:14,939] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_01_optim_states.pt [default0]:[2022-03-03 06:03:14,955] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_16_optim_states.pt [default3]:[2022-03-03 06:03:14,955] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_19_optim_states.pt [default5]:[2022-03-03 06:03:15,018] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_25_optim_states.pt [default0]:[2022-03-03 06:03:15,018] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_32_optim_states.pt [default1]:[2022-03-03 06:03:14,981] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_17_optim_states.pt [default1]:[2022-03-03 06:03:15,081] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_41_optim_states.pt [default5]:[2022-03-03 06:03:15,103] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_45_optim_states.pt [default0]:[2022-03-03 06:03:15,080] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_12_optim_states.pt [default1]:[2022-03-03 06:03:15,071] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_41_optim_states.pt [default5]:[2022-03-03 06:03:15,200] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_17_optim_states.pt [default2]:[2022-03-03 06:03:15,210] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_34_optim_states.pt [default7]:[2022-03-03 06:03:15,132] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_39_optim_states.pt [default1]:[2022-03-03 06:03:15,294] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_41_optim_states.pt [default7]:[2022-03-03 06:03:15,327] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_35_optim_states.pt [default2]:[2022-03-03 06:03:15,345] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_46_optim_states.pt [default0]:[2022-03-03 06:03:15,283] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_40_optim_states.pt [default7]:[2022-03-03 06:03:15,349] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_03_optim_states.pt [default6]:[2022-03-03 06:03:15,413] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_02_optim_states.pt [default7]:[2022-03-03 06:03:15,327] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_03_optim_states.pt [default3]:[2022-03-03 06:03:15,395] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_35_optim_states.pt [default5]:[2022-03-03 06:03:15,423] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_33_optim_states.pt [default5]:[2022-03-03 06:03:15,370] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_41_optim_states.pt [default5]:[2022-03-03 06:03:15,416] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_21_optim_states.pt [default3]:[2022-03-03 06:03:15,408] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_11_optim_states.pt [default2]:[2022-03-03 06:03:15,495] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_14_optim_states.pt [default4]:[2022-03-03 06:03:15,483] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_32_optim_states.pt [default5]:[2022-03-03 06:03:15,496] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_41_optim_states.pt [default1]:[2022-03-03 06:03:15,501] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_05_optim_states.pt [default5]:[2022-03-03 06:03:15,520] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_33_optim_states.pt [default4]:[2022-03-03 06:03:15,555] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_16_optim_states.pt [default3]:[2022-03-03 06:03:15,608] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_31_optim_states.pt [default7]:[2022-03-03 06:03:15,545] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_23_optim_states.pt [default7]:[2022-03-03 06:03:15,646] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_47_optim_states.pt [default6]:[2022-03-03 06:03:15,692] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_30_optim_states.pt [default7]:[2022-03-03 06:03:15,666] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_31_optim_states.pt [default2]:[2022-03-03 06:03:15,673] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_30_optim_states.pt [default7]:[2022-03-03 06:03:15,667] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_35_optim_states.pt [default0]:[2022-03-03 06:03:15,716] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_04_optim_states.pt [default4]:[2022-03-03 06:03:15,661] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt [default5]:[2022-03-03 06:03:15,772] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_29_optim_states.pt [default1]:[2022-03-03 06:03:15,909] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_17_optim_states.pt [default3]:[2022-03-03 06:03:15,920] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_31_optim_states.pt [default3]:[2022-03-03 06:03:15,886] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_19_optim_states.pt [default4]:[2022-03-03 06:03:15,849] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_24_optim_states.pt [default5]:[2022-03-03 06:03:15,904] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_17_optim_states.pt [default5]:[2022-03-03 06:03:16,017] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_37_optim_states.pt [default1]:[2022-03-03 06:03:15,930] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_05_optim_states.pt [default6]:[2022-03-03 06:03:15,972] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_34_optim_states.pt [default0]:[2022-03-03 06:03:15,942] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_28_optim_states.pt [default1]:[2022-03-03 06:03:15,994] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_29_optim_states.pt [default4]:[2022-03-03 06:03:15,971] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_28_optim_states.pt [default2]:[2022-03-03 06:03:15,977] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_14_optim_states.pt [default2]:[2022-03-03 06:03:16,068] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_30_optim_states.pt [default4]:[2022-03-03 06:03:16,081] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_36_optim_states.pt [default5]:[2022-03-03 06:03:16,132] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_29_optim_states.pt [default7]:[2022-03-03 06:03:16,129] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_15_optim_states.pt [default4]:[2022-03-03 06:03:16,087] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_32_optim_states.pt [default4]:[2022-03-03 06:03:16,121] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_28_optim_states.pt [default0]:[2022-03-03 06:03:16,169] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_28_optim_states.pt [default4]:[2022-03-03 06:03:16,200] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_40_optim_states.pt [default3]:[2022-03-03 06:03:16,241] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_15_optim_states.pt [default0]:[2022-03-03 06:03:16,279] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_16_optim_states.pt [default7]:[2022-03-03 06:03:16,285] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_15_optim_states.pt [default5]:[2022-03-03 06:03:16,271] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_29_optim_states.pt [default4]:[2022-03-03 06:03:16,368] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_20_optim_states.pt [default3]:[2022-03-03 06:03:16,417] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_47_optim_states.pt [default0]:[2022-03-03 06:03:16,422] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_40_optim_states.pt [default4]:[2022-03-03 06:03:16,392] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_44_optim_states.pt [default0]:[2022-03-03 06:03:16,431] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_20_optim_states.pt [default1]:[2022-03-03 06:03:16,503] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_21_optim_states.pt [default4]:[2022-03-03 06:03:16,433] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt [default5]:[2022-03-03 06:03:16,507] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt [default1]:[2022-03-03 06:03:16,440] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_25_optim_states.pt [default6]:[2022-03-03 06:03:16,501] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_22_optim_states.pt [default1]:[2022-03-03 06:03:16,540] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_45_optim_states.pt [default6]:[2022-03-03 06:03:16,476] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_14_optim_states.pt [default2]:[2022-03-03 06:03:16,517] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_02_optim_states.pt [default2]:[2022-03-03 06:03:16,586] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_18_optim_states.pt [default1]:[2022-03-03 06:03:16,650] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_33_optim_states.pt [default6]:[2022-03-03 06:03:16,703] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_42_optim_states.pt [default1]:[2022-03-03 06:03:16,719] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_01_optim_states.pt [default6]:[2022-03-03 06:03:16,769] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_02_optim_states.pt [default7]:[2022-03-03 06:03:16,824] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_03_optim_states.pt [default0]:[2022-03-03 06:03:16,820] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_20_optim_states.pt [default7]:[2022-03-03 06:03:16,869] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_43_optim_states.pt [default3]:[2022-03-03 06:03:16,932] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_03_optim_states.pt [default6]:[2022-03-03 06:03:17,015] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_06_optim_states.pt [default4]:[2022-03-03 06:03:16,820] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_20_optim_states.pt [default5]:[2022-03-03 06:03:17,042] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_33_optim_states.pt [default2]:[2022-03-03 06:03:17,075] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_42_optim_states.pt [default7]:[2022-03-03 06:03:17,137] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_19_optim_states.pt [default3]:[2022-03-03 06:03:17,139] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_31_optim_states.pt [default0]:[2022-03-03 06:03:17,106] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [default3]:[2022-03-03 06:03:17,167] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_35_optim_states.pt [default3]:[2022-03-03 06:03:17,232] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_19_optim_states.pt [default7]:[2022-03-03 06:03:17,193] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_23_optim_states.pt [default5]:[2022-03-03 06:03:17,192] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_21_optim_states.pt [default4]:[2022-03-03 06:03:17,183] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_28_optim_states.pt [default6]:[2022-03-03 06:03:17,357] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_30_optim_states.pt [default6]:[2022-03-03 06:03:17,357] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_18_optim_states.pt [default5]:[2022-03-03 06:03:17,390] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_25_optim_states.pt [default2]:[2022-03-03 06:03:17,356] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_30_optim_states.pt [default4]:[2022-03-03 06:03:17,412] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_24_optim_states.pt [default0]:[2022-03-03 06:03:17,414] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_12_optim_states.pt [default3]:[2022-03-03 06:03:17,456] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_27_optim_states.pt [default0]:[2022-03-03 06:03:17,388] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_44_optim_states.pt [default2]:[2022-03-03 06:03:17,494] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_22_optim_states.pt [default0]:[2022-03-03 06:03:17,478] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_40_optim_states.pt [default6]:[2022-03-03 06:03:17,660] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_22_optim_states.pt [default3]:[2022-03-03 06:03:17,692] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_23_optim_states.pt [default1]:[2022-03-03 06:03:17,733] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_29_optim_states.pt [default7]:[2022-03-03 06:03:17,707] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_31_optim_states.pt [default0]:[2022-03-03 06:03:17,741] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_28_optim_states.pt [default6]:[2022-03-03 06:03:17,752] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_46_optim_states.pt [default0]:[2022-03-03 06:03:17,886] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_32_optim_states.pt [default6]:[2022-03-03 06:03:17,897] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_06_optim_states.pt [default7]:[2022-03-03 06:03:17,879] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_11_optim_states.pt [default7]:[2022-03-03 06:03:17,896] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_07_optim_states.pt [default2]:[2022-03-03 06:03:17,943] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_34_optim_states.pt [default3]:[2022-03-03 06:03:17,999] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_27_optim_states.pt [default2]:[2022-03-03 06:03:18,071] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_26_optim_states.pt [default2]:[2022-03-03 06:03:18,037] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_06_optim_states.pt [default2]:[2022-03-03 06:03:18,077] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_34_optim_states.pt [default1]:[2022-03-03 06:03:18,094] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_29_optim_states.pt [default0]:[2022-03-03 06:03:18,104] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_04_optim_states.pt [default4]:[2022-03-03 06:03:18,141] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_28_optim_states.pt [default1]:[2022-03-03 06:03:18,236] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_45_optim_states.pt [default1]:[2022-03-03 06:03:18,197] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_13_optim_states.pt [default0]:[2022-03-03 06:03:18,202] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_44_optim_states.pt [default1]:[2022-03-03 06:03:18,189] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_33_optim_states.pt [default7]:[2022-03-03 06:03:18,254] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_23_optim_states.pt [default3]:[2022-03-03 06:03:18,208] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_35_optim_states.pt [default1]:[2022-03-03 06:03:18,291] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_25_optim_states.pt [default0]:[2022-03-03 06:03:18,347] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_24_optim_states.pt [default2]:[2022-03-03 06:03:18,393] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_38_optim_states.pt [default0]:[2022-03-03 06:03:18,418] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_32_optim_states.pt [default2]:[2022-03-03 06:03:18,409] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_06_optim_states.pt [default3]:[2022-03-03 06:03:18,423] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_39_optim_states.pt [default5]:[2022-03-03 06:03:18,444] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_05_optim_states.pt [default7]:[2022-03-03 06:03:18,450] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_07_optim_states.pt [default7]:[2022-03-03 06:03:18,393] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_07_optim_states.pt [default4]:[2022-03-03 06:03:18,441] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_04_optim_states.pt [default7]:[2022-03-03 06:03:18,530] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_35_optim_states.pt [default6]:[2022-03-03 06:03:18,518] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_26_optim_states.pt [default3]:[2022-03-03 06:03:18,560] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_07_optim_states.pt [default2]:[2022-03-03 06:03:18,588] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_26_optim_states.pt [default6]:[2022-03-03 06:03:18,566] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_34_optim_states.pt [default4]:[2022-03-03 06:03:18,549] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_20_optim_states.pt [default4]:[2022-03-03 06:03:18,613] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_04_optim_states.pt [default2]:[2022-03-03 06:03:18,578] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_22_optim_states.pt [default3]:[2022-03-03 06:03:18,630] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_23_optim_states.pt [default3]:[2022-03-03 06:03:18,600] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_27_optim_states.pt [default7]:[2022-03-03 06:03:18,599] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_19_optim_states.pt [default3]:[2022-03-03 06:03:18,593] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_19_optim_states.pt [default7]:[2022-03-03 06:03:18,635] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_27_optim_states.pt [default1]:[2022-03-03 06:03:18,704] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_21_optim_states.pt [default0]:[2022-03-03 06:03:18,739] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_04_optim_states.pt [default5]:[2022-03-03 06:03:18,720] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_29_optim_states.pt [default6]:[2022-03-03 06:03:18,823] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_18_optim_states.pt [default5]:[2022-03-03 06:03:18,764] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_21_optim_states.pt [default6]:[2022-03-03 06:03:18,799] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_22_optim_states.pt [default1]:[2022-03-03 06:03:18,830] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_37_optim_states.pt [default0]:[2022-03-03 06:03:18,839] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_36_optim_states.pt [default3]:[2022-03-03 06:03:18,867] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_07_optim_states.pt [default5]:[2022-03-03 06:03:18,909] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_05_optim_states.pt [default2]:[2022-03-03 06:03:18,949] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_26_optim_states.pt [default7]:[2022-03-03 06:03:18,900] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_15_optim_states.pt [default6]:[2022-03-03 06:03:18,956] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_14_optim_states.pt [default1]:[2022-03-03 06:03:18,968] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_05_optim_states.pt [default5]:[2022-03-03 06:03:19,032] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_25_optim_states.pt [default1]:[2022-03-03 06:03:19,153] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_13_optim_states.pt [default6]:[2022-03-03 06:03:19,153] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_06_optim_states.pt [default4]:[2022-03-03 06:03:19,234] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_16_optim_states.pt [default5]:[2022-03-03 06:03:19,149] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_17_optim_states.pt [default1]:[2022-03-03 06:03:19,207] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_29_optim_states.pt [default0]:[2022-03-03 06:03:19,225] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_12_optim_states.pt [default3]:[2022-03-03 06:03:19,244] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_43_optim_states.pt [default4]:[2022-03-03 06:03:19,384] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_12_optim_states.pt [default5]:[2022-03-03 06:03:19,459] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_13_optim_states.pt [default4]:[2022-03-03 06:03:19,487] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_32_optim_states.pt [default4]:[2022-03-03 06:03:19,607] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_24_optim_states.pt [default2]:[2022-03-03 06:03:19,627] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_22_optim_states.pt [default6]:[2022-03-03 06:03:19,616] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_26_optim_states.pt [default3]:[2022-03-03 06:03:19,649] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_23_optim_states.pt [default7]:[2022-03-03 06:03:19,571] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_27_optim_states.pt [default2]:[2022-03-03 06:03:19,769] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_46_optim_states.pt [default2]:[2022-03-03 06:03:19,846] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_10_optim_states.pt [default4]:[2022-03-03 06:03:19,975] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_16_optim_states.pt [default0]:[2022-03-03 06:03:19,984] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_28_optim_states.pt [default5]:[2022-03-03 06:03:20,035] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_17_optim_states.pt [default2]:[2022-03-03 06:03:20,039] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_18_optim_states.pt [default4]:[2022-03-03 06:03:20,099] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_20_optim_states.pt [default3]:[2022-03-03 06:03:20,235] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_31_optim_states.pt [default5]:[2022-03-03 06:03:20,178] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_21_optim_states.pt [default5]:[2022-03-03 06:03:20,285] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_45_optim_states.pt [default0]:[2022-03-03 06:03:20,301] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_20_optim_states.pt [default2]:[2022-03-03 06:03:20,260] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_30_optim_states.pt [default0]:[2022-03-03 06:03:20,494] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_20_optim_states.pt [default1]:[2022-03-03 06:03:20,522] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_21_optim_states.pt [default7]:[2022-03-03 06:03:20,583] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_11_optim_states.pt [default7]:[2022-03-03 06:03:20,595] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_31_optim_states.pt [default6]:[2022-03-03 06:03:20,648] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_30_optim_states.pt [default3]:[2022-03-03 06:03:20,649] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_11_optim_states.pt [default1]:[2022-03-03 06:03:20,745] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_09_optim_states.pt [default6]:[2022-03-03 06:03:20,762] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_10_optim_states.pt [default3]:[2022-03-03 06:03:20,736] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_47_optim_states.pt [default0]:[2022-03-03 06:03:20,820] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_16_optim_states.pt [default1]:[2022-03-03 06:03:20,857] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_17_optim_states.pt [default2]:[2022-03-03 06:03:20,883] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_10_optim_states.pt [default3]:[2022-03-03 06:03:21,252] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_23_optim_states.pt [default2]:[2022-03-03 06:03:21,228] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_22_optim_states.pt [default6]:[2022-03-03 06:03:21,360] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_10_optim_states.pt [default4]:[2022-03-03 06:03:21,419] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_40_optim_states.pt [default0]:[2022-03-03 06:03:21,397] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_08_optim_states.pt [default5]:[2022-03-03 06:03:21,435] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_41_optim_states.pt [default0]:[2022-03-03 06:03:21,526] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_08_optim_states.pt [default2]:[2022-03-03 06:03:21,497] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_46_optim_states.pt [default5]:[2022-03-03 06:03:21,550] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_45_optim_states.pt [default4]:[2022-03-03 06:03:21,595] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_44_optim_states.pt [default4]:[2022-03-03 06:03:21,628] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_08_optim_states.pt [default6]:[2022-03-03 06:03:21,671] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_22_optim_states.pt [default1]:[2022-03-03 06:03:21,656] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_0_mp_rank_09_optim_states.pt [default7]:[2022-03-03 06:03:21,753] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_23_optim_states.pt [default5]:[2022-03-03 06:03:21,785] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_33_optim_states.pt [default7]:[2022-03-03 06:03:21,858] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_43_optim_states.pt [default5]:[2022-03-03 06:03:21,784] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_09_optim_states.pt [default6]:[2022-03-03 06:03:21,906] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_30_optim_states.pt [default7]:[2022-03-03 06:03:21,930] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_31_optim_states.pt [default6]:[2022-03-03 06:03:22,010] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_42_optim_states.pt [default6]:[2022-03-03 06:03:22,244] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_34_optim_states.pt [default7]:[2022-03-03 06:03:22,236] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_1_mp_rank_35_optim_states.pt [default4]:[2022-03-03 06:03:22,500] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_08_optim_states.pt [default5]:[2022-03-03 06:03:22,585] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_09_optim_states.pt [default4]:[2022-03-03 06:03:23,066] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_44_optim_states.pt [default0]:[2022-03-03 06:03:23,429] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_16_optim_states.pt [default1]:[2022-03-03 06:03:23,530] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_6_mp_rank_17_optim_states.pt [default3]:[2022-03-03 06:03:23,567] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_2_mp_rank_47_optim_states.pt [default1]:[2022-03-03 06:03:23,965] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_45_optim_states.pt [default0]:[2022-03-03 06:03:23,943] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_4_mp_rank_44_optim_states.pt [default6]:[2022-03-03 06:03:24,063] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_46_optim_states.pt [default7]:[2022-03-03 06:03:24,799] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_5_mp_rank_47_optim_states.pt [default6]:[2022-03-03 06:03:26,882] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_46_optim_states.pt [default7]:[2022-03-03 06:03:26,882] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_3_mp_rank_47_optim_states.pt [default7]:[2022-03-03 06:03:27,395] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_19_optim_states.pt [default0]: successfully saved checkpoint at iteration 50 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default6]:[2022-03-03 06:03:27,418] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step50/bf16_zero_pp_rank_7_mp_rank_18_optim_states.pt [default7]:time (ms) | save-checkpoint: 40699.24 [default7]: iteration 51/ 128728 | consumed samples: 816 | consumed tokens: 1671168 | elapsed time per iteration (s): 55.58 | learning rate: 2.674E-07 | global batch size: 16 | lm loss: 1.196870E+01 | grad norm: 2.511 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 0.288 | TFLOPs: 2.20 | [default7]: iteration 52/ 128728 | consumed samples: 832 | consumed tokens: 1703936 | elapsed time per iteration (s): 14.83 | learning rate: 2.726E-07 | global batch size: 16 | lm loss: 1.192159E+01 | grad norm: 2.002 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.079 | TFLOPs: 8.26 | WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178153 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178154 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178261 closing signal SIGTERM srun: Job step aborted: Waiting up to 62 seconds for job step to finish. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 160808 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178155 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182110 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225345 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178262 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194919 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225346 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178301 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202752 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 160809 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 206323 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178156 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182111 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194920 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225347 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178302 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202753 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199596 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178263 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 160810 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 206324 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199185 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194807 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178157 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182112 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194921 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225348 closing signal SIGTERM slurmstepd: error: *** STEP 176449.0 ON jean-zay-iam01 CANCELLED AT 2022-03-03T06:04:05 *** WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199757 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178683 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198618 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178303 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205449 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203096 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 219084 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 38228 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 181690 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 40089 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 79441 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202754 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198711 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199597 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205490 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204987 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178264 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 151809 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197446 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197516 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197071 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197108 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197562 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198050 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 160811 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 206325 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199186 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194808 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178158 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182113 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194922 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225349 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199758 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178684 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198619 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178304 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205450 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203097 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 219085 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 38229 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 181691 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 40090 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 79442 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202755 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198712 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199598 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205491 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204988 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178265 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 151810 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197447 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205487 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58953 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197517 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197072 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197109 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197563 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198051 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202191 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204306 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 206326 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199187 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194809 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178159 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194923 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225350 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199759 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178685 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198620 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178305 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205451 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203098 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 219086 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 38230 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 181692 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 40091 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 79443 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202756 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199599 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205492 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178266 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 151811 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205488 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58954 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197518 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197073 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197110 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197564 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198052 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202192 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204307 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 206327 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199188 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194810 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178161 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194924 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225351 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199760 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178686 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198621 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178306 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205452 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 181693 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 40092 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 79444 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202757 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199600 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205493 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204989 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58955 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197519 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197565 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198053 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 160812 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202193 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204308 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 206328 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199189 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194811 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194925 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225352 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199761 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178687 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198622 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178307 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205453 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 40093 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 79445 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202758 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199601 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205494 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178267 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 151812 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197520 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197074 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198054 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 160813 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202194 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204309 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 206329 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199190 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194812 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194926 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199762 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178688 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198623 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178308 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205454 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203099 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 219087 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 38231 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 181694 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 40094 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 79446 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202759 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199602 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205495 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178268 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197448 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205489 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197521 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197075 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197111 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198055 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 160814 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202195 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204310 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 206330 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199191 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194813 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182114 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199763 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178689 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198624 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205455 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 219088 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 181695 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 40095 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 79447 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198713 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199603 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205496 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204990 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205490 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58956 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197522 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197076 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197566 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198056 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 160815 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204311 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199192 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 194814 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 199764 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 178690 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198625 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205456 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 219089 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 181696 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 79448 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205497 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205491 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197523 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197077 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197567 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198057 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202196 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204312 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182115 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203100 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 219090 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 38232 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 181697 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 40096 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204991 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197449 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205492 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58957 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197078 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197112 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197568 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202197 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204313 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209280 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203101 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 219091 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198714 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204992 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58958 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197113 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 191841 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197569 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202198 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244854 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209281 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205493 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58959 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209282 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209283 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209284 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209285 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204993 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209286 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197114 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209287 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58960 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 38233 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 205494 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 151813 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 191842 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244855 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 204994 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197115 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197450 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197451 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203102 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197452 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203103 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197454 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 191843 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 176305 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198715 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 191844 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 151814 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 176306 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198716 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 191845 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 38234 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198717 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 191846 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 198718 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226691 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226692 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 191847 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226693 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226694 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226695 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202145 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226696 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226697 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226698 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 192914 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 38235 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 192915 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197784 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 151815 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 201473 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 191848 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197785 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 192916 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202146 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 201474 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182116 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 201475 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244856 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202147 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 201476 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 182117 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 192917 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 201477 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 176307 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197786 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 192918 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 201478 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 192919 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202412 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202413 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 201479 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 176308 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 192920 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244857 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 151816 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 201480 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197787 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244858 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 176309 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 192921 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203947 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244859 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202148 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202414 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 176310 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197788 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203948 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202149 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 176311 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244860 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197789 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197790 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244861 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 197791 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 176312 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203949 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202415 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202416 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203950 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202150 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202417 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203951 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202151 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202418 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203952 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202152 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 188076 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203953 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202419 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 203954 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 188077 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 188078 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 188079 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 188080 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 188081 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 188082 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 188083 closing signal SIGTERM WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** [default7]:> setting tensorboard ... [default0]:using world size: 384, data-parallel-size: 8, tensor-model-parallel size: 4, pipeline-model-parallel size: 12 [default0]:WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer with tokenizer_type:PretrainedFromHF [default0]:accumulate and all-reduce gradients in fp32 for bfloat16 data type. [default0]:using torch.bfloat16 for parameters ... [default0]:------------------------ arguments ------------------------ [default0]: abort_on_unmet_fused_kernel_constraints ......... True [default0]: accumulate_allreduce_grads_in_fp32 .............. True [default0]: adam_beta1 ...................................... 0.9 [default0]: adam_beta2 ...................................... 0.95 [default0]: adam_eps ........................................ 1e-08 [default0]: adlr_autoresume ................................. False [default0]: adlr_autoresume_interval ........................ 1000 [default0]: apply_query_key_layer_scaling ................... True [default0]: apply_residual_connection_post_layernorm ........ False [default0]: attention_dropout ............................... 0.1 [default0]: attention_softmax_in_fp32 ....................... False [default0]: bert_binary_head ................................ True [default0]: bert_load ....................................... None [default0]: bf16 ............................................ True [default0]: bias_dropout_fusion ............................. True [default0]: bias_gelu_fusion ................................ True [default0]: biencoder_projection_dim ........................ 0 [default0]: biencoder_shared_query_context_model ............ False [default0]: block_data_path ................................. None [default0]: checkpoint_activations .......................... True [default0]: checkpoint_in_cpu ............................... False [default0]: checkpoint_num_layers ........................... 1 [default0]: clip_grad ....................................... 1.0 [default0]: codecarbon_dir .................................. None [default0]: consumed_train_samples .......................... 0 [default0]: consumed_train_tokens ........................... 0 [default0]: consumed_valid_samples .......................... 0 [default0]: contigious_checkpointing ........................ False [default0]: cpu_optimizer ................................... False [default0]: cpu_torch_adam .................................. False [default0]: curriculum_learning ............................. False [default0]: data_impl ....................................... mmap [default0]: data_parallel_size .............................. 8 [default0]: data_path ....................................... None [default0]: dataloader_type ................................. single [default0]: DDP_impl ........................................ local [default0]: decoder_seq_length .............................. None [default0]: deepscale ....................................... False [default0]: deepscale_config ................................ None [default0]: deepspeed ....................................... True [default0]: deepspeed_activation_checkpointing .............. True [default0]: deepspeed_config ................................ ./ds_config.176547.json [default0]: deepspeed_mpi ................................... False [default0]: distribute_checkpointed_activations ............. False [default0]: distributed_backend ............................. nccl [default0]: embed_layernorm ................................. True [default0]: embedding_path .................................. None [default0]: encoder_seq_length .............................. 2048 [default0]: eod_mask_loss ................................... False [default0]: eval_interval ................................... 1000 [default0]: eval_iters ...................................... 10 [default0]: eval_only ....................................... None [default0]: evidence_data_path .............................. None [default0]: exit_duration_in_mins ........................... 1190 [default0]: exit_interval ................................... None [default0]: ffn_hidden_size ................................. 57344 [default0]: finetune ........................................ False [default0]: fp16 ............................................ False [default0]: fp16_lm_cross_entropy ........................... False [default0]: fp32_residual_connection ........................ False [default0]: gigaflos_no_embeds .............................. 0 [default0]: global_batch_size ............................... 2048 [default0]: glu_activation .................................. None [default0]: hidden_dropout .................................. 0.1 [default0]: hidden_size ..................................... 14336 [default0]: hysteresis ...................................... 2 [default0]: ict_head_size ................................... None [default0]: ict_load ........................................ None [default0]: img_dim ......................................... 224 [default0]: indexer_batch_size .............................. 128 [default0]: indexer_log_interval ............................ 1000 [default0]: init_method_std ................................. 0.0048 [default0]: init_method_xavier_uniform ...................... False [default0]: initial_loss_scale .............................. 4294967296 [default0]: kill_switch_path ................................ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/kill-switch-tr11-176B-exp1 [default0]: kv_channels ..................................... 128 [default0]: layernorm_epsilon ............................... 1e-05 [default0]: lazy_mpu_init ................................... None [default0]: load ............................................ /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default0]: local_rank ...................................... None [default0]: log_batch_size_to_tensorboard ................... True [default0]: log_interval .................................... 1 [default0]: log_learning_rate_to_tensorboard ................ True [default0]: log_level ....................................... None [default0]: log_level_replica ............................... None [default0]: log_loss_scale_to_tensorboard ................... True [default0]: log_num_zeros_in_grad ........................... False [default0]: log_params_norm ................................. False [default0]: log_path ........................................ None [default0]: log_timers_to_tensorboard ....................... True [default0]: log_validation_ppl_to_tensorboard ............... True [default0]: loss_on_targets_only ............................ False [default0]: loss_scale ...................................... None [default0]: loss_scale_window ............................... 1000 [default0]: lr .............................................. 6e-05 [default0]: lr_decay_iters .................................. None [default0]: lr_decay_samples ................................ 200000000 [default0]: lr_decay_style .................................. cosine [default0]: lr_decay_tokens ................................. None [default0]: lr_warmup_fraction .............................. None [default0]: lr_warmup_iters ................................. 0 [default0]: lr_warmup_samples ............................... 183105 [default0]: make_vocab_size_divisible_by .................... 128 [default0]: mask_prob ....................................... 0.15 [default0]: masked_softmax_fusion ........................... True [default0]: max_position_embeddings ......................... 2048 [default0]: memory_centric_tiled_linear ..................... False [default0]: merge_file ...................................... None [default0]: micro_batch_size ................................ 2 [default0]: min_loss_scale .................................. 1.0 [default0]: min_lr .......................................... 6e-06 [default0]: mmap_warmup ..................................... False [default0]: no_load_optim ................................... None [default0]: no_load_rng ..................................... None [default0]: no_save_optim ................................... None [default0]: no_save_rng ..................................... None [default0]: num_attention_heads ............................. 112 [default0]: num_channels .................................... 3 [default0]: num_classes ..................................... 1000 [default0]: num_layers ...................................... 70 [default0]: num_layers_per_virtual_pipeline_stage ........... None [default0]: num_workers ..................................... 2 [default0]: onnx_safe ....................................... None [default0]: openai_gelu ..................................... False [default0]: optimizer ....................................... adam [default0]: override_lr_scheduler ........................... False [default0]: pad_vocab_size_to ............................... 250880 [default0]: params_dtype .................................... torch.bfloat16 [default0]: partition_activations ........................... False [default0]: patch_dim ....................................... 16 [default0]: pipeline_model_parallel_size .................... 12 [default0]: position_embedding_type ......................... PositionEmbeddingType.alibi [default0]: pp_partition_method ............................. type:transformer|embedding [default0]: profile_backward ................................ False [default0]: query_in_block_prob ............................. 0.1 [default0]: rampup_batch_size ............................... ['16', '16', '9_765_625'] [default0]: rank ............................................ 0 [default0]: remote_device ................................... none [default0]: reset_attention_mask ............................ False [default0]: reset_position_ids .............................. False [default0]: retriever_report_topk_accuracies ................ [] [default0]: retriever_score_scaling ......................... False [default0]: retriever_seq_length ............................ 256 [default0]: reweight_loss_based_on_position_frequency ....... False [default0]: sample_rate ..................................... 1.0 [default0]: save ............................................ /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default0]: save_interval ................................... 500 [default0]: scatter_gather_tensors_in_pipeline .............. True [default0]: scattered_embeddings ............................ False [default0]: seed ............................................ 42 [default0]: seq_length ...................................... 2048 [default0]: sgd_momentum .................................... 0.9 [default0]: short_seq_prob .................................. 0.1 [default0]: skip_train_iteration_range ...................... None [default0]: split ........................................... None [default0]: split_transformers .............................. False [default0]: synchronize_each_layer .......................... False [default0]: tensor_model_parallel_size ...................... 4 [default0]: tensorboard_dir ................................. /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/tr11-176B-ml-logs/tensorboard [default0]: tensorboard_log_interval ........................ 1 [default0]: tensorboard_queue_size .......................... 5 [default0]: test_weighted_split_names ....................... ['test'] [default0]: test_weighted_split_paths ....................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']] [default0]: test_weighted_split_paths_path .................. None [default0]: test_weighted_split_splits ...................... [['0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0']] [default0]: test_weighted_split_weights ..................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']] [default0]: tile_factor ..................................... 1 [default0]: titles_data_path ................................ None [default0]: tokenizer_name_or_path .......................... bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k [default0]: tokenizer_type .................................. PretrainedFromHF [default0]: train_iters ..................................... None [default0]: train_samples ................................... 220000000 [default0]: train_tokens .................................... None [default0]: train_weighted_split_names ...................... ['train'] [default0]: train_weighted_split_paths ...................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']] [default0]: train_weighted_split_paths_path ................. None [default0]: train_weighted_split_splits ..................... [['0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949']] [default0]: train_weighted_split_weights .................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']] [default0]: use_bnb_optimizer ............................... False [default0]: use_checkpoint_lr_scheduler ..................... False [default0]: use_contiguous_buffers_in_ddp ................... True [default0]: use_cpu_initialization .......................... None [default0]: use_one_sent_docs ............................... False [default0]: use_pin_memory .................................. False [default0]: valid_weighted_split_names ...................... ['valid'] [default0]: valid_weighted_split_paths ...................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']] [default0]: valid_weighted_split_paths_path ................. None [default0]: valid_weighted_split_splits ..................... [['0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999']] [default0]: valid_weighted_split_weights .................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']] [default0]: virtual_pipeline_model_parallel_size ............ None [default0]: vocab_extra_ids ................................. 0 [default0]: vocab_file ...................................... None [default0]: weight_decay .................................... 0.1 [default0]: world_size ...................................... 384 [default0]: zero_allgather_bucket_size ...................... 0.0 [default0]: zero_contigious_gradients ....................... False [default0]: zero_reduce_bucket_size ......................... 0.0 [default0]: zero_reduce_scatter ............................. False [default0]: zero_stage ...................................... 0 [default0]:-------------------- end of arguments --------------------- [default0]:will use batch size rampup starting from global batch size 16 to global batch size 2048 with batch size increments 16 over 9765625 samples. [default0]:> building PretrainedFromHF tokenizer ... [default0]: vocab file is un-used. loading tokenizer from pre-trained model [default0]:Offline mode: forcing local_files_only=True [default0]:Offline mode: forcing local_files_only=True [default0]:Can't load following files from cache: ['added_tokens_file'] and cannot check if these files are necessary for the tokenizer to operate. [default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/special_tokens_map.json from cache at /gpfswork/rech/six/commun/models/b0b3428eb9bea3ef62a6e9983742117e4860f4ec1af66eebce1702b8ec7cb364.9d6cd81ef646692fb1c169a880161ea1cb95f49694f220aced9b704b457e51dd [default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/tokenizer_config.json from cache at /gpfswork/rech/six/commun/models/31fb66a88196017b3a12c4798e55bcf8a11b312b42dd9429c83f7237c0a8a807.e683c1a11fe6388761e34fd7cddbcd77f3552cefb70e9aca4a4cc72c027c8f40 [default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/tokenizer.json from cache at /gpfswork/rech/six/commun/models/b28b4c1d8aed4c72b765cce6a9a7ce8c5460d05a5b4ea6fa5855dff6a721d171.397b0d7316cb89fa15f0bebce2bd6c5e71e92a14e95de167940173a60253b03e [default0]: > padded vocab (size: 250680) with 200 dummy tokens (new size: 250880) [default0]:DeepSpeed general environment info: [default0]:torch install path ............... ['/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch'] [default0]:torch version .................... 1.11.0+cu115 [default0]:torch cuda version ............... 11.5 [default0]:nvcc version ..................... 11.4 [default0]:deepspeed install path ........... ['/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed'] [default0]:deepspeed info ................... 0.6.0+ed26ef4, ed26ef4, olruwase/bf16-updates [default0]:deepspeed wheel compiled w. ...... torch 1.11, cuda 11.5 [default0]:**** Git info for Megatron: git_hash=0415583 git_branch=sync-meg-lm **** [default0]:> initializing torch distributed ... [default0]:> initializing tensor model parallel with size 4 [default0]:> initializing pipeline model parallel with size 12 [default0]:> setting random seeds to 42 ... [default0]:[2022-03-03 06:05:07,541] [INFO] [checkpointing.py:226:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2760 and data parallel seed: 42 [default0]:> compiling dataset index builder ... [default0]:make: Entering directory '/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/data' [default0]:make: Nothing to be done for 'default'. [default0]:make: Leaving directory '/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/data' [default0]:>>> done with dataset index builder. Compilation time: 0.100 seconds [default0]:> compiling and loading fused kernels ... [default0]:Detected CUDA files, patching ldflags [default0]:Emitting ninja build file /gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... [default0]:Building extension module scaled_upper_triang_masked_softmax_cuda... [default0]:Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [default0]:ninja: no work to do. [default0]:Loading extension module scaled_upper_triang_masked_softmax_cuda... [default0]:Detected CUDA files, patching ldflags [default0]:Emitting ninja build file /gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... [default0]:Building extension module scaled_masked_softmax_cuda... [default0]:Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [default0]:ninja: no work to do. [default0]:Loading extension module scaled_masked_softmax_cuda... [default0]:Detected CUDA files, patching ldflags [default0]:Emitting ninja build file /gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... [default0]:Building extension module fused_mix_prec_layer_norm_cuda... [default0]:Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [default0]:ninja: no work to do. [default0]:Loading extension module fused_mix_prec_layer_norm_cuda... [default0]:>>> done with compiling and loading fused kernels. Compilation time: 8.454 seconds [default0]:time to initialize megatron (seconds): 12.097 [default0]:[after megatron is initialized] datetime: 2022-03-03 06:05:16 [default0]:building GPT model ... [default0]:[2022-03-03 06:05:16,133] [INFO] [utils.py:828:see_memory_usage] Before Building Model [default0]:[2022-03-03 06:05:16,134] [INFO] [utils.py:829:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [default0]:[2022-03-03 06:05:16,134] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.19 GB, percent = 8.6% [default0]:SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None [default0]:Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=0, model=1): 1, ProcessCoord(pipe=0, data=0, model=2): 2, ProcessCoord(pipe=0, data=0, model=3): 3, ProcessCoord(pipe=0, data=1, model=0): 4, ProcessCoord(pipe=0, data=1, model=1): 5, ProcessCoord(pipe=0, data=1, model=2): 6, ProcessCoord(pipe=0, data=1, model=3): 7, ProcessCoord(pipe=0, data=2, model=0): 8, ProcessCoord(pipe=0, data=2, model=1): 9, ProcessCoord(pipe=0, data=2, model=2): 10, ProcessCoord(pipe=0, data=2, model=3): 11, ProcessCoord(pipe=0, data=3, model=0): 12, ProcessCoord(pipe=0, data=3, model=1): 13, ProcessCoord(pipe=0, data=3, model=2): 14, ProcessCoord(pipe=0, data=3, model=3): 15, ProcessCoord(pipe=0, data=4, model=0): 16, ProcessCoord(pipe=0, data=4, model=1): 17, ProcessCoord(pipe=0, data=4, model=2): 18, ProcessCoord(pipe=0, data=4, model=3): 19, ProcessCoord(pipe=0, data=5, model=0): 20, ProcessCoord(pipe=0, data=5, model=1): 21, ProcessCoord(pipe=0, data=5, model=2): 22, ProcessCoord(pipe=0, data=5, model=3): 23, ProcessCoord(pipe=0, data=6, model=0): 24, ProcessCoord(pipe=0, data=6, model=1): 25, ProcessCoord(pipe=0, data=6, model=2): 26, ProcessCoord(pipe=0, data=6, model=3): 27, ProcessCoord(pipe=0, data=7, model=0): 28, ProcessCoord(pipe=0, data=7, model=1): 29, ProcessCoord(pipe=0, data=7, model=2): 30, ProcessCoord(pipe=0, data=7, model=3): 31, ProcessCoord(pipe=1, data=0, model=0): 32, ProcessCoord(pipe=1, data=0, model=1): 33, ProcessCoord(pipe=1, data=0, model=2): 34, ProcessCoord(pipe=1, data=0, model=3): 35, ProcessCoord(pipe=1, data=1, model=0): 36, ProcessCoord(pipe=1, data=1, model=1): 37, ProcessCoord(pipe=1, data=1, model=2): 38, ProcessCoord(pipe=1, data=1, model=3): 39, ProcessCoord(pipe=1, data=2, model=0): 40, ProcessCoord(pipe=1, data=2, model=1): 41, ProcessCoord(pipe=1, data=2, model=2): 42, ProcessCoord(pipe=1, data=2, model=3): 43, ProcessCoord(pipe=1, data=3, model=0): 44, ProcessCoord(pipe=1, data=3, model=1): 45, ProcessCoord(pipe=1, data=3, model=2): 46, ProcessCoord(pipe=1, data=3, model=3): 47, ProcessCoord(pipe=1, data=4, model=0): 48, ProcessCoord(pipe=1, data=4, model=1): 49, ProcessCoord(pipe=1, data=4, model=2): 50, ProcessCoord(pipe=1, data=4, model=3): 51, ProcessCoord(pipe=1, data=5, model=0): 52, ProcessCoord(pipe=1, data=5, model=1): 53, ProcessCoord(pipe=1, data=5, model=2): 54, ProcessCoord(pipe=1, data=5, model=3): 55, ProcessCoord(pipe=1, data=6, model=0): 56, ProcessCoord(pipe=1, data=6, model=1): 57, ProcessCoord(pipe=1, data=6, model=2): 58, ProcessCoord(pipe=1, data=6, model=3): 59, ProcessCoord(pipe=1, data=7, model=0): 60, ProcessCoord(pipe=1, data=7, model=1): 61, ProcessCoord(pipe=1, data=7, model=2): 62, ProcessCoord(pipe=1, data=7, model=3): 63, ProcessCoord(pipe=2, data=0, model=0): 64, ProcessCoord(pipe=2, data=0, model=1): 65, ProcessCoord(pipe=2, data=0, model=2): 66, ProcessCoord(pipe=2, data=0, model=3): 67, ProcessCoord(pipe=2, data=1, model=0): 68, ProcessCoord(pipe=2, data=1, model=1): 69, ProcessCoord(pipe=2, data=1, model=2): 70, ProcessCoord(pipe=2, data=1, model=3): 71, ProcessCoord(pipe=2, data=2, model=0): 72, ProcessCoord(pipe=2, data=2, model=1): 73, ProcessCoord(pipe=2, data=2, model=2): 74, ProcessCoord(pipe=2, data=2, model=3): 75, ProcessCoord(pipe=2, data=3, model=0): 76, ProcessCoord(pipe=2, data=3, model=1): 77, ProcessCoord(pipe=2, data=3, model=2): 78, ProcessCoord(pipe=2, data=3, model=3): 79, ProcessCoord(pipe=2, data=4, model=0): 80, ProcessCoord(pipe=2, data=4, model=1): 81, ProcessCoord(pipe=2, data=4, model=2): 82, ProcessCoord(pipe=2, data=4, model=3): 83, ProcessCoord(pipe=2, data=5, model=0): 84, ProcessCoord(pipe=2, data=5, model=1): 85, ProcessCoord(pipe=2, data=5, model=2): 86, ProcessCoord(pipe=2, data=5, model=3): 87, ProcessCoord(pipe=2, data=6, model=0): 88, ProcessCoord(pipe=2, data=6, model=1): 89, ProcessCoord(pipe=2, data=6, model=2): 90, ProcessCoord(pipe=2, data=6, model=3): 91, ProcessCoord(pipe=2, data=7, model=0): 92, ProcessCoord(pipe=2, data=7, model=1): 93, ProcessCoord(pipe=2, data=7, model=2): 94, ProcessCoord(pipe=2, data=7, model=3): 95, ProcessCoord(pipe=3, data=0, model=0): 96, ProcessCoord(pipe=3, data=0, model=1): 97, ProcessCoord(pipe=3, data=0, model=2): 98, ProcessCoord(pipe=3, data=0, model=3): 99, ProcessCoord(pipe=3, data=1, model=0): 100, ProcessCoord(pipe=3, data=1, model=1): 101, ProcessCoord(pipe=3, data=1, model=2): 102, ProcessCoord(pipe=3, data=1, model=3): 103, ProcessCoord(pipe=3, data=2, model=0): 104, ProcessCoord(pipe=3, data=2, model=1): 105, ProcessCoord(pipe=3, data=2, model=2): 106, ProcessCoord(pipe=3, data=2, model=3): 107, ProcessCoord(pipe=3, data=3, model=0): 108, ProcessCoord(pipe=3, data=3, model=1): 109, ProcessCoord(pipe=3, data=3, model=2): 110, ProcessCoord(pipe=3, data=3, model=3): 111, ProcessCoord(pipe=3, data=4, model=0): 112, ProcessCoord(pipe=3, data=4, model=1): 113, ProcessCoord(pipe=3, data=4, model=2): 114, ProcessCoord(pipe=3, data=4, model=3): 115, ProcessCoord(pipe=3, data=5, model=0): 116, ProcessCoord(pipe=3, data=5, model=1): 117, ProcessCoord(pipe=3, data=5, model=2): 118, ProcessCoord(pipe=3, data=5, model=3): 119, ProcessCoord(pipe=3, data=6, model=0): 120, ProcessCoord(pipe=3, data=6, model=1): 121, ProcessCoord(pipe=3, data=6, model=2): 122, ProcessCoord(pipe=3, data=6, model=3): 123, ProcessCoord(pipe=3, data=7, model=0): 124, ProcessCoord(pipe=3, data=7, model=1): 125, ProcessCoord(pipe=3, data=7, model=2): 126, ProcessCoord(pipe=3, data=7, model=3): 127, ProcessCoord(pipe=4, data=0, model=0): 128, ProcessCoord(pipe=4, data=0, model=1): 129, ProcessCoord(pipe=4, data=0, model=2): 130, ProcessCoord(pipe=4, data=0, model=3): 131, ProcessCoord(pipe=4, data=1, model=0): 132, ProcessCoord(pipe=4, data=1, model=1): 133, ProcessCoord(pipe=4, data=1, model=2): 134, ProcessCoord(pipe=4, data=1, model=3): 135, ProcessCoord(pipe=4, data=2, model=0): 136, ProcessCoord(pipe=4, data=2, model=1): 137, ProcessCoord(pipe=4, data=2, model=2): 138, ProcessCoord(pipe=4, data=2, model=3): 139, ProcessCoord(pipe=4, data=3, model=0): 140, ProcessCoord(pipe=4, data=3, model=1): 141, ProcessCoord(pipe=4, data=3, model=2): 142, ProcessCoord(pipe=4, data=3, model=3): 143, ProcessCoord(pipe=4, data=4, model=0): 144, ProcessCoord(pipe=4, data=4, model=1): 145, ProcessCoord(pipe=4, data=4, model=2): 146, ProcessCoord(pipe=4, data=4, model=3): 147, ProcessCoord(pipe=4, data=5, model=0): 148, ProcessCoord(pipe=4, data=5, model=1): 149, ProcessCoord(pipe=4, data=5, model=2): 150, ProcessCoord(pipe=4, data=5, model=3): 151, ProcessCoord(pipe=4, data=6, model=0): 152, ProcessCoord(pipe=4, data=6, model=1): 153, ProcessCoord(pipe=4, data=6, model=2): 154, ProcessCoord(pipe=4, data=6, model=3): 155, ProcessCoord(pipe=4, data=7, model=0): 156, ProcessCoord(pipe=4, data=7, model=1): 157, ProcessCoord(pipe=4, data=7, model=2): 158, ProcessCoord(pipe=4, data=7, model=3): 159, ProcessCoord(pipe=5, data=0, model=0): 160, ProcessCoord(pipe=5, data=0, model=1): 161, ProcessCoord(pipe=5, data=0, model=2): 162, ProcessCoord(pipe=5, data=0, model=3): 163, ProcessCoord(pipe=5, data=1, model=0): 164, ProcessCoord(pipe=5, data=1, model=1): 165, ProcessCoord(pipe=5, data=1, model=2): 166, ProcessCoord(pipe=5, data=1, model=3): 167, ProcessCoord(pipe=5, data=2, model=0): 168, ProcessCoord(pipe=5, data=2, model=1): 169, ProcessCoord(pipe=5, data=2, model=2): 170, ProcessCoord(pipe=5, data=2, model=3): 171, ProcessCoord(pipe=5, data=3, model=0): 172, ProcessCoord(pipe=5, data=3, model=1): 173, ProcessCoord(pipe=5, data=3, model=2): 174, ProcessCoord(pipe=5, data=3, model=3): 175, ProcessCoord(pipe=5, data=4, model=0): 176, ProcessCoord(pipe=5, data=4, model=1): 177, ProcessCoord(pipe=5, data=4, model=2): 178, ProcessCoord(pipe=5, data=4, model=3): 179, ProcessCoord(pipe=5, data=5, model=0): 180, ProcessCoord(pipe=5, data=5, model=1): 181, ProcessCoord(pipe=5, data=5, model=2): 182, ProcessCoord(pipe=5, data=5, model=3): 183, ProcessCoord(pipe=5, data=6, model=0): 184, ProcessCoord(pipe=5, data=6, model=1): 185, ProcessCoord(pipe=5, data=6, model=2): 186, ProcessCoord(pipe=5, data=6, model=3): 187, ProcessCoord(pipe=5, data=7, model=0): 188, ProcessCoord(pipe=5, data=7, model=1): 189, ProcessCoord(pipe=5, data=7, model=2): 190, ProcessCoord(pipe=5, data=7, model=3): 191, ProcessCoord(pipe=6, data=0, model=0): 192, ProcessCoord(pipe=6, data=0, model=1): 193, ProcessCoord(pipe=6, data=0, model=2): 194, ProcessCoord(pipe=6, data=0, model=3): 195, ProcessCoord(pipe=6, data=1, model=0): 196, ProcessCoord(pipe=6, data=1, model=1): 197, ProcessCoord(pipe=6, data=1, model=2): 198, ProcessCoord(pipe=6, data=1, model=3): 199, ProcessCoord(pipe=6, data=2, model=0): 200, ProcessCoord(pipe=6, data=2, model=1): 201, ProcessCoord(pipe=6, data=2, model=2): 202, ProcessCoord(pipe=6, data=2, model=3): 203, ProcessCoord(pipe=6, data=3, model=0): 204, ProcessCoord(pipe=6, data=3, model=1): 205, ProcessCoord(pipe=6, data=3, model=2): 206, ProcessCoord(pipe=6, data=3, model=3): 207, ProcessCoord(pipe=6, data=4, model=0): 208, ProcessCoord(pipe=6, data=4, model=1): 209, ProcessCoord(pipe=6, data=4, model=2): 210, ProcessCoord(pipe=6, data=4, model=3): 211, ProcessCoord(pipe=6, data=5, model=0): 212, ProcessCoord(pipe=6, data=5, model=1): 213, ProcessCoord(pipe=6, data=5, model=2): 214, ProcessCoord(pipe=6, data=5, model=3): 215, ProcessCoord(pipe=6, data=6, model=0): 216, ProcessCoord(pipe=6, data=6, model=1): 217, ProcessCoord(pipe=6, data=6, model=2): 218, ProcessCoord(pipe=6, data=6, model=3): 219, ProcessCoord(pipe=6, data=7, model=0): 220, ProcessCoord(pipe=6, data=7, model=1): 221, ProcessCoord(pipe=6, data=7, model=2): 222, ProcessCoord(pipe=6, data=7, model=3): 223, ProcessCoord(pipe=7, data=0, model=0): 224, ProcessCoord(pipe=7, data=0, model=1): 225, ProcessCoord(pipe=7, data=0, model=2): 226, ProcessCoord(pipe=7, data=0, model=3): 227, ProcessCoord(pipe=7, data=1, model=0): 228, ProcessCoord(pipe=7, data=1, model=1): 229, ProcessCoord(pipe=7, data=1, model=2): 230, ProcessCoord(pipe=7, data=1, model=3): 231, ProcessCoord(pipe=7, data=2, model=0): 232, ProcessCoord(pipe=7, data=2, model=1): 233, ProcessCoord(pipe=7, data=2, model=2): 234, ProcessCoord(pipe=7, data=2, model=3): 235, ProcessCoord(pipe=7, data=3, model=0): 236, ProcessCoord(pipe=7, data=3, model=1): 237, ProcessCoord(pipe=7, data=3, model=2): 238, ProcessCoord(pipe=7, data=3, model=3): 239, ProcessCoord(pipe=7, data=4, model=0): 240, ProcessCoord(pipe=7, data=4, model=1): 241, ProcessCoord(pipe=7, data=4, model=2): 242, ProcessCoord(pipe=7, data=4, model=3): 243, ProcessCoord(pipe=7, data=5, model=0): 244, ProcessCoord(pipe=7, data=5, model=1): 245, ProcessCoord(pipe=7, data=5, model=2): 246, ProcessCoord(pipe=7, data=5, model=3): 247, ProcessCoord(pipe=7, data=6, model=0): 248, ProcessCoord(pipe=7, data=6, model=1): 249, ProcessCoord(pipe=7, data=6, model=2): 250, ProcessCoord(pipe=7, data=6, model=3): 251, ProcessCoord(pipe=7, data=7, model=0): 252, ProcessCoord(pipe=7, data=7, model=1): 253, ProcessCoord(pipe=7, data=7, model=2): 254, ProcessCoord(pipe=7, data=7, model=3): 255, ProcessCoord(pipe=8, data=0, model=0): 256, ProcessCoord(pipe=8, data=0, model=1): 257, ProcessCoord(pipe=8, data=0, model=2): 258, ProcessCoord(pipe=8, data=0, model=3): 259, ProcessCoord(pipe=8, data=1, model=0): 260, ProcessCoord(pipe=8, data=1, model=1): 261, ProcessCoord(pipe=8, data=1, model=2): 262, ProcessCoord(pipe=8, data=1, model=3): 263, ProcessCoord(pipe=8, data=2, model=0): 264, ProcessCoord(pipe=8, data=2, model=1): 265, ProcessCoord(pipe=8, data=2, model=2): 266, ProcessCoord(pipe=8, data=2, model=3): 267, ProcessCoord(pipe=8, data=3, model=0): 268, ProcessCoord(pipe=8, data=3, model=1): 269, ProcessCoord(pipe=8, data=3, model=2): 270, ProcessCoord(pipe=8, data=3, model=3): 271, ProcessCoord(pipe=8, data=4, model=0): 272, ProcessCoord(pipe=8, data=4, model=1): 273, ProcessCoord(pipe=8, data=4, model=2): 274, ProcessCoord(pipe=8, data=4, model=3): 275, ProcessCoord(pipe=8, data=5, model=0): 276, ProcessCoord(pipe=8, data=5, model=1): 277, ProcessCoord(pipe=8, data=5, model=2): 278, ProcessCoord(pipe=8, data=5, model=3): 279, ProcessCoord(pipe=8, data=6, model=0): 280, ProcessCoord(pipe=8, data=6, model=1): 281, ProcessCoord(pipe=8, data=6, model=2): 282, ProcessCoord(pipe=8, data=6, model=3): 283, ProcessCoord(pipe=8, data=7, model=0): 284, ProcessCoord(pipe=8, data=7, model=1): 285, ProcessCoord(pipe=8, data=7, model=2): 286, ProcessCoord(pipe=8, data=7, model=3): 287, ProcessCoord(pipe=9, data=0, model=0): 288, ProcessCoord(pipe=9, data=0, model=1): 289, ProcessCoord(pipe=9, data=0, model=2): 290, ProcessCoord(pipe=9, data=0, model=3): 291, ProcessCoord(pipe=9, data=1, model=0): 292, ProcessCoord(pipe=9, data=1, model=1): 293, ProcessCoord(pipe=9, data=1, model=2): 294, ProcessCoord(pipe=9, data=1, model=3): 295, ProcessCoord(pipe=9, data=2, model=0): 296, ProcessCoord(pipe=9, data=2, model=1): 297, ProcessCoord(pipe=9, data=2, model=2): 298, ProcessCoord(pipe=9, data=2, model=3): 299, ProcessCoord(pipe=9, data=3, model=0): 300, ProcessCoord(pipe=9, data=3, model=1): 301, ProcessCoord(pipe=9, data=3, model=2): 302, ProcessCoord(pipe=9, data=3, model=3): 303, ProcessCoord(pipe=9, data=4, model=0): 304, ProcessCoord(pipe=9, data=4, model=1): 305, ProcessCoord(pipe=9, data=4, model=2): 306, ProcessCoord(pipe=9, data=4, model=3): 307, ProcessCoord(pipe=9, data=5, model=0): 308, ProcessCoord(pipe=9, data=5, model=1): 309, ProcessCoord(pipe=9, data=5, model=2): 310, ProcessCoord(pipe=9, data=5, model=3): 311, ProcessCoord(pipe=9, data=6, model=0): 312, ProcessCoord(pipe=9, data=6, model=1): 313, ProcessCoord(pipe=9, data=6, model=2): 314, ProcessCoord(pipe=9, data=6, model=3): 315, ProcessCoord(pipe=9, data=7, model=0): 316, ProcessCoord(pipe=9, data=7, model=1): 317, ProcessCoord(pipe=9, data=7, model=2): 318, ProcessCoord(pipe=9, data=7, model=3): 319, ProcessCoord(pipe=10, data=0, model=0): 320, ProcessCoord(pipe=10, data=0, model=1): 321, ProcessCoord(pipe=10, data=0, model=2): 322, ProcessCoord(pipe=10, data=0, model=3): 323, ProcessCoord(pipe=10, data=1, model=0): 324, ProcessCoord(pipe=10, data=1, model=1): 325, ProcessCoord(pipe=10, data=1, model=2): 326, ProcessCoord(pipe=10, data=1, model=3): 327, ProcessCoord(pipe=10, data=2, model=0): 328, ProcessCoord(pipe=10, data=2, model=1): 329, ProcessCoord(pipe=10, data=2, model=2): 330, ProcessCoord(pipe=10, data=2, model=3): 331, ProcessCoord(pipe=10, data=3, model=0): 332, ProcessCoord(pipe=10, data=3, model=1): 333, ProcessCoord(pipe=10, data=3, model=2): 334, ProcessCoord(pipe=10, data=3, model=3): 335, ProcessCoord(pipe=10, data=4, model=0): 336, ProcessCoord(pipe=10, data=4, model=1): 337, ProcessCoord(pipe=10, data=4, model=2): 338, ProcessCoord(pipe=10, data=4, model=3): 339, ProcessCoord(pipe=10, data=5, model=0): 340, ProcessCoord(pipe=10, data=5, model=1): 341, ProcessCoord(pipe=10, data=5, model=2): 342, ProcessCoord(pipe=10, data=5, model=3): 343, ProcessCoord(pipe=10, data=6, model=0): 344, ProcessCoord(pipe=10, data=6, model=1): 345, ProcessCoord(pipe=10, data=6, model=2): 346, ProcessCoord(pipe=10, data=6, model=3): 347, ProcessCoord(pipe=10, data=7, model=0): 348, ProcessCoord(pipe=10, data=7, model=1): 349, ProcessCoord(pipe=10, data=7, model=2): 350, ProcessCoord(pipe=10, data=7, model=3): 351, ProcessCoord(pipe=11, data=0, model=0): 352, ProcessCoord(pipe=11, data=0, model=1): 353, ProcessCoord(pipe=11, data=0, model=2): 354, ProcessCoord(pipe=11, data=0, model=3): 355, ProcessCoord(pipe=11, data=1, model=0): 356, ProcessCoord(pipe=11, data=1, model=1): 357, ProcessCoord(pipe=11, data=1, model=2): 358, ProcessCoord(pipe=11, data=1, model=3): 359, ProcessCoord(pipe=11, data=2, model=0): 360, ProcessCoord(pipe=11, data=2, model=1): 361, ProcessCoord(pipe=11, data=2, model=2): 362, ProcessCoord(pipe=11, data=2, model=3): 363, ProcessCoord(pipe=11, data=3, model=0): 364, ProcessCoord(pipe=11, data=3, model=1): 365, ProcessCoord(pipe=11, data=3, model=2): 366, ProcessCoord(pipe=11, data=3, model=3): 367, ProcessCoord(pipe=11, data=4, model=0): 368, ProcessCoord(pipe=11, data=4, model=1): 369, ProcessCoord(pipe=11, data=4, model=2): 370, ProcessCoord(pipe=11, data=4, model=3): 371, ProcessCoord(pipe=11, data=5, model=0): 372, ProcessCoord(pipe=11, data=5, model=1): 373, ProcessCoord(pipe=11, data=5, model=2): 374, ProcessCoord(pipe=11, data=5, model=3): 375, ProcessCoord(pipe=11, data=6, model=0): 376, ProcessCoord(pipe=11, data=6, model=1): 377, ProcessCoord(pipe=11, data=6, model=2): 378, ProcessCoord(pipe=11, data=6, model=3): 379, ProcessCoord(pipe=11, data=7, model=0): 380, ProcessCoord(pipe=11, data=7, model=1): 381, ProcessCoord(pipe=11, data=7, model=2): 382, ProcessCoord(pipe=11, data=7, model=3): 383} [default0]:[2022-03-03 06:05:18,115] [INFO] [module.py:365:_partition_layers] Partitioning pipeline stages with method type:transformer|embedding [default0]:stage=0 layers=8 [default0]: 0: _to_float16 [default0]: 1: EmbeddingPipe [default0]: 2: <lambda> [default0]: 3: ParallelTransformerLayerPipe [default0]: 4: ParallelTransformerLayerPipe [default0]: 5: ParallelTransformerLayerPipe [default0]: 6: ParallelTransformerLayerPipe [default0]: 7: ParallelTransformerLayerPipe [default0]:stage=1 layers=6 [default0]: 8: ParallelTransformerLayerPipe [default0]: 9: ParallelTransformerLayerPipe [default0]: 10: ParallelTransformerLayerPipe [default0]: 11: ParallelTransformerLayerPipe [default0]: 12: ParallelTransformerLayerPipe [default0]: 13: ParallelTransformerLayerPipe [default0]:stage=2 layers=6 [default0]: 14: ParallelTransformerLayerPipe [default0]: 15: ParallelTransformerLayerPipe [default0]: 16: ParallelTransformerLayerPipe [default0]: 17: ParallelTransformerLayerPipe [default0]: 18: ParallelTransformerLayerPipe [default0]: 19: ParallelTransformerLayerPipe [default0]:stage=3 layers=6 [default0]: 20: ParallelTransformerLayerPipe [default0]: 21: ParallelTransformerLayerPipe [default0]: 22: ParallelTransformerLayerPipe [default0]: 23: ParallelTransformerLayerPipe [default0]: 24: ParallelTransformerLayerPipe [default0]: 25: ParallelTransformerLayerPipe [default0]:stage=4 layers=6 [default0]: 26: ParallelTransformerLayerPipe [default0]: 27: ParallelTransformerLayerPipe [default0]: 28: ParallelTransformerLayerPipe [default0]: 29: ParallelTransformerLayerPipe [default0]: 30: ParallelTransformerLayerPipe [default0]: 31: ParallelTransformerLayerPipe [default0]:stage=5 layers=6 [default0]: 32: ParallelTransformerLayerPipe [default0]: 33: ParallelTransformerLayerPipe [default0]: 34: ParallelTransformerLayerPipe [default0]: 35: ParallelTransformerLayerPipe [default0]: 36: ParallelTransformerLayerPipe [default0]: 37: ParallelTransformerLayerPipe [default0]:stage=6 layers=6 [default0]: 38: ParallelTransformerLayerPipe [default0]: 39: ParallelTransformerLayerPipe [default0]: 40: ParallelTransformerLayerPipe [default0]: 41: ParallelTransformerLayerPipe [default0]: 42: ParallelTransformerLayerPipe [default0]: 43: ParallelTransformerLayerPipe [default0]:stage=7 layers=6 [default0]: 44: ParallelTransformerLayerPipe [default0]: 45: ParallelTransformerLayerPipe [default0]: 46: ParallelTransformerLayerPipe [default0]: 47: ParallelTransformerLayerPipe [default0]: 48: ParallelTransformerLayerPipe [default0]: 49: ParallelTransformerLayerPipe [default0]:stage=8 layers=6 [default0]: 50: ParallelTransformerLayerPipe [default0]: 51: ParallelTransformerLayerPipe [default0]: 52: ParallelTransformerLayerPipe [default0]: 53: ParallelTransformerLayerPipe [default0]: 54: ParallelTransformerLayerPipe [default0]: 55: ParallelTransformerLayerPipe [default0]:stage=9 layers=6 [default0]: 56: ParallelTransformerLayerPipe [default0]: 57: ParallelTransformerLayerPipe [default0]: 58: ParallelTransformerLayerPipe [default0]: 59: ParallelTransformerLayerPipe [default0]: 60: ParallelTransformerLayerPipe [default0]: 61: ParallelTransformerLayerPipe [default0]:stage=10 layers=6 [default0]: 62: ParallelTransformerLayerPipe [default0]: 63: ParallelTransformerLayerPipe [default0]: 64: ParallelTransformerLayerPipe [default0]: 65: ParallelTransformerLayerPipe [default0]: 66: ParallelTransformerLayerPipe [default0]: 67: ParallelTransformerLayerPipe [default0]:stage=11 layers=9 [default0]: 68: ParallelTransformerLayerPipe [default0]: 69: ParallelTransformerLayerPipe [default0]: 70: ParallelTransformerLayerPipe [default0]: 71: ParallelTransformerLayerPipe [default0]: 72: ParallelTransformerLayerPipe [default0]: 73: <lambda> [default0]: 74: MixedFusedLayerNorm [default0]: 75: EmbeddingPipe [default0]: 76: float16_to_fp32 [default0]: loss: CrossEntropy [default0]:[2022-03-03 06:05:19,292] [INFO] [utils.py:828:see_memory_usage] After Building Model [default0]:[2022-03-03 06:05:19,293] [INFO] [utils.py:829:see_memory_usage] MA 7.43 GB Max_MA 7.43 GB CA 7.45 GB Max_CA 7 GB [default0]:[2022-03-03 06:05:19,293] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.6 GB, percent = 8.7% [default0]:setting training iterations to 128728 [default0]:> learning rate decay style: cosine [default0]:DeepSpeed is enabled. [default0]:[2022-03-03 06:05:19,315] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed info: version=0.6.0+ed26ef4, git-hash=ed26ef4, git-branch=olruwase/bf16-updates [default0]:[2022-03-03 06:05:21,109] [INFO] [engine.py:278:__init__] DeepSpeed Flops Profiler Enabled: False [default0]:[2022-03-03 06:05:21,109] [INFO] [engine.py:1092:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer [default0]:[2022-03-03 06:05:21,109] [INFO] [engine.py:1098:_configure_optimizer] Using client Optimizer as basic optimizer [default0]:[2022-03-03 06:05:21,110] [INFO] [engine.py:1114:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam [default0]:[2022-03-03 06:05:21,110] [INFO] [engine.py:1328:_configure_bf16_optimizer] Creating unfused BF16 optimizer [default0]:[2022-03-03 06:05:21,137] [INFO] [utils.py:828:see_memory_usage] begin bf16_optimizer [default0]:[2022-03-03 06:05:21,138] [INFO] [utils.py:829:see_memory_usage] MA 7.42 GB Max_MA 7.43 GB CA 7.45 GB Max_CA 7 GB [default0]:[2022-03-03 06:05:21,138] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.95 GB, percent = 8.7% [default0]:[2022-03-03 06:05:21,159] [INFO] [utils.py:828:see_memory_usage] before initializing group 0 [default0]:[2022-03-03 06:05:21,160] [INFO] [utils.py:829:see_memory_usage] MA 7.42 GB Max_MA 7.42 GB CA 7.45 GB Max_CA 7 GB [default0]:[2022-03-03 06:05:21,160] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.95 GB, percent = 8.7% [default0]:[2022-03-03 06:05:21,227] [INFO] [utils.py:828:see_memory_usage] after initializing group 0 [default0]:[2022-03-03 06:05:21,228] [INFO] [utils.py:829:see_memory_usage] MA 17.01 GB Max_MA 17.01 GB CA 20.23 GB Max_CA 20 GB [default0]:[2022-03-03 06:05:21,228] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.95 GB, percent = 8.7% [default0]:[2022-03-03 06:05:21,248] [INFO] [utils.py:828:see_memory_usage] before initializing group 1 [default0]:[2022-03-03 06:05:21,248] [INFO] [utils.py:829:see_memory_usage] MA 17.01 GB Max_MA 17.01 GB CA 20.23 GB Max_CA 20 GB [default0]:[2022-03-03 06:05:21,248] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.95 GB, percent = 8.7% [default0]:[2022-03-03 06:05:21,290] [INFO] [utils.py:828:see_memory_usage] after initializing group 1 [default0]:[2022-03-03 06:05:21,291] [INFO] [utils.py:829:see_memory_usage] MA 24.11 GB Max_MA 24.11 GB CA 30.5 GB Max_CA 30 GB [default0]:[2022-03-03 06:05:21,291] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.95 GB, percent = 8.7% [default0]:[2022-03-03 06:05:21,310] [INFO] [utils.py:828:see_memory_usage] before initializing group 2 [default0]:[2022-03-03 06:05:21,311] [INFO] [utils.py:829:see_memory_usage] MA 24.11 GB Max_MA 24.11 GB CA 30.5 GB Max_CA 30 GB [default0]:[2022-03-03 06:05:21,311] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.95 GB, percent = 8.7% [default0]:[2022-03-03 06:05:21,331] [INFO] [utils.py:828:see_memory_usage] after initializing group 2 [default0]:[2022-03-03 06:05:21,332] [INFO] [utils.py:829:see_memory_usage] MA 24.12 GB Max_MA 24.12 GB CA 30.5 GB Max_CA 30 GB [default0]:[2022-03-03 06:05:21,332] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.95 GB, percent = 8.7% [default0]:[2022-03-03 06:05:21,352] [INFO] [utils.py:828:see_memory_usage] before initialize_optimizer [default0]:[2022-03-03 06:05:21,352] [INFO] [utils.py:829:see_memory_usage] MA 24.12 GB Max_MA 24.12 GB CA 30.5 GB Max_CA 30 GB [default0]:[2022-03-03 06:05:21,352] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.95 GB, percent = 8.7% [default0]:[2022-03-03 06:05:21,398] [INFO] [utils.py:828:see_memory_usage] end initialize_optimizer [default0]:[2022-03-03 06:05:21,398] [INFO] [utils.py:829:see_memory_usage] MA 27.82 GB Max_MA 27.82 GB CA 34.21 GB Max_CA 34 GB [default0]:[2022-03-03 06:05:21,398] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.95 GB, percent = 8.7% [default0]:[2022-03-03 06:05:21,417] [INFO] [utils.py:828:see_memory_usage] end bf16_optimizer [default0]:[2022-03-03 06:05:21,418] [INFO] [utils.py:829:see_memory_usage] MA 27.82 GB Max_MA 27.82 GB CA 34.21 GB Max_CA 34 GB [default0]:[2022-03-03 06:05:21,418] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.95 GB, percent = 8.7% [default0]:[2022-03-03 06:05:21,418] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam [default0]:[2022-03-03 06:05:21,418] [INFO] [engine.py:795:_configure_lr_scheduler] DeepSpeed using client LR scheduler [default0]:[2022-03-03 06:05:21,418] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed LR Scheduler = <megatron.learning_rates.AnnealingLR object at 0x147227cac8b0> [default0]:[2022-03-03 06:05:21,418] [INFO] [logging.py:69:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0, 0.0], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)] [default0]:[2022-03-03 06:05:21,418] [INFO] [config.py:1057:print] DeepSpeedEngine configuration: [default0]:[2022-03-03 06:05:21,418] [INFO] [config.py:1061:print] activation_checkpointing_config { [default0]: "partition_activations": false, [default0]: "contiguous_memory_optimization": false, [default0]: "cpu_checkpointing": false, [default0]: "number_checkpoints": null, [default0]: "synchronize_checkpoint_boundary": false, [default0]: "profile": false [default0]:} [default0]:[2022-03-03 06:05:21,418] [INFO] [config.py:1061:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [default0]:[2022-03-03 06:05:21,418] [INFO] [config.py:1061:print] amp_enabled .................. False [default0]:[2022-03-03 06:05:21,418] [INFO] [config.py:1061:print] amp_params ................... False [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] autotuning_config ............ { [default0]: "enabled": false, [default0]: "start_step": null, [default0]: "end_step": null, [default0]: "metric_path": null, [default0]: "arg_mappings": null, [default0]: "metric": "throughput", [default0]: "model_info": null, [default0]: "results_dir": null, [default0]: "exps_dir": null, [default0]: "overwrite": true, [default0]: "fast": true, [default0]: "start_profile_step": 3, [default0]: "end_profile_step": 5, [default0]: "tuner_type": "gridsearch", [default0]: "tuner_early_stopping": 5, [default0]: "tuner_num_trials": 50, [default0]: "model_info_path": null, [default0]: "mp_size": 1, [default0]: "max_train_batch_size": null, [default0]: "min_train_batch_size": 1, [default0]: "max_train_micro_batch_size_per_gpu": 1.024000e+03, [default0]: "min_train_micro_batch_size_per_gpu": 1, [default0]: "num_tuning_micro_batch_sizes": 3 [default0]:} [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] bfloat16_enabled ............. True [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] checkpoint_tag_validation_enabled True [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] checkpoint_tag_validation_fail False [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] communication_data_type ...... None [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] curriculum_enabled ........... False [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] curriculum_params ............ False [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] dataloader_drop_last ......... False [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] disable_allgather ............ False [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] dump_state ................... False [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] dynamic_loss_scale_args ...... None [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] eigenvalue_enabled ........... False [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] eigenvalue_gas_boundary_resolution 1 [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] eigenvalue_layer_name ........ bert.encoder.layer [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] eigenvalue_layer_num ......... 0 [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] eigenvalue_max_iter .......... 100 [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] eigenvalue_stability ......... 1e-06 [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] eigenvalue_tol ............... 0.01 [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] eigenvalue_verbose ........... False [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] elasticity_enabled ........... False [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] flops_profiler_config ........ { [default0]: "enabled": false, [default0]: "profile_step": 1, [default0]: "module_depth": -1, [default0]: "top_modules": 1, [default0]: "detailed": true, [default0]: "output_file": null [default0]:} [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] fp16_enabled ................. False [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] fp16_master_weights_and_gradients False [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] fp16_mixed_quantize .......... False [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] global_rank .................. 0 [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] gradient_accumulation_steps .. 128 [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] gradient_clipping ............ 1.0 [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] gradient_predivide_factor .... 1.0 [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] initial_dynamic_scale ........ 1 [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] loss_scale ................... 1.0 [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] memory_breakdown ............. False [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] optimizer_legacy_fusion ...... False [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] optimizer_name ............... None [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] optimizer_params ............. None [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] pld_enabled .................. False [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] pld_params ................... False [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] prescale_gradients ........... False [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] quantize_change_rate ......... 0.001 [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] quantize_groups .............. 1 [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] quantize_offset .............. 1000 [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] quantize_period .............. 1000 [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] quantize_rounding ............ 0 [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] quantize_start_bits .......... 16 [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] quantize_target_bits ......... 8 [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] quantize_training_enabled .... False [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] quantize_type ................ 0 [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] quantize_verbose ............. False [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] scheduler_name ............... None [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] scheduler_params ............. None [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] sparse_attention ............. None [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] sparse_gradients_enabled ..... False [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] steps_per_print .............. 2000 [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] tensorboard_enabled .......... False [default0]:[2022-03-03 06:05:21,419] [INFO] [config.py:1061:print] tensorboard_job_name ......... DeepSpeedJobName [default0]:[2022-03-03 06:05:21,420] [INFO] [config.py:1061:print] tensorboard_output_path ...... [default0]:[2022-03-03 06:05:21,420] [INFO] [config.py:1061:print] train_batch_size ............. 2048 [default0]:[2022-03-03 06:05:21,420] [INFO] [config.py:1061:print] train_micro_batch_size_per_gpu 2 [default0]:[2022-03-03 06:05:21,420] [INFO] [config.py:1061:print] use_quantizer_kernel ......... False [default0]:[2022-03-03 06:05:21,420] [INFO] [config.py:1061:print] wall_clock_breakdown ......... False [default0]:[2022-03-03 06:05:21,420] [INFO] [config.py:1061:print] world_size ................... 8 [default0]:[2022-03-03 06:05:21,420] [INFO] [config.py:1061:print] zero_allow_untested_optimizer False [default0]:[2022-03-03 06:05:21,420] [INFO] [config.py:1061:print] zero_config .................. { [default0]: "stage": 0, [default0]: "contiguous_gradients": true, [default0]: "reduce_scatter": true, [default0]: "reduce_bucket_size": 5.000000e+08, [default0]: "allgather_partitions": true, [default0]: "allgather_bucket_size": 5.000000e+08, [default0]: "overlap_comm": false, [default0]: "load_from_fp32_weights": true, [default0]: "elastic_checkpoint": false, [default0]: "offload_param": null, [default0]: "offload_optimizer": null, [default0]: "sub_group_size": 1.000000e+09, [default0]: "prefetch_bucket_size": 5.000000e+07, [default0]: "param_persistence_threshold": 1.000000e+05, [default0]: "max_live_parameters": 1.000000e+09, [default0]: "max_reuse_distance": 1.000000e+09, [default0]: "gather_16bit_weights_on_model_save": false, [default0]: "ignore_unused_parameters": true, [default0]: "round_robin_gradients": false, [default0]: "legacy_stage1": false [default0]:} [default0]:[2022-03-03 06:05:21,420] [INFO] [config.py:1061:print] zero_enabled ................. False [default0]:[2022-03-03 06:05:21,420] [INFO] [config.py:1061:print] zero_optimization_stage ...... 0 [default0]:[2022-03-03 06:05:21,420] [INFO] [config.py:1063:print] json = { [default0]: "train_micro_batch_size_per_gpu": 2, [default0]: "train_batch_size": 2.048000e+03, [default0]: "gradient_clipping": 1.0, [default0]: "zero_optimization": { [default0]: "stage": 0 [default0]: }, [default0]: "bf16": { [default0]: "enabled": true [default0]: }, [default0]: "steps_per_print": 2.000000e+03, [default0]: "wall_clock_breakdown": false [default0]:} [default0]:[2022-03-03 06:05:21,420] [INFO] [engine.py:93:__init__] CONFIG: micro_batches=128 micro_batch_size=2 [default1]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=65 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=67 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=66 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=97 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=99 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=96 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=320 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=323 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=321 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=194 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=192 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=193 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=195 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=289 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=291 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=288 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=131 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=128 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=290 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=129 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=130 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=225 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=34 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=35 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=33 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=32 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=322 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=226 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=227 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=224 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=2 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=352 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=64 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=0 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=258 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=256 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=259 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=257 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=98 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=355 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=354 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=353 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=1 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=3 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=160 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=161 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=162 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-03 06:05:23,498] [INFO] [engine.py:151:__init__] RANK=163 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]: > using checkpoint value 6e-05 for learning rate [default0]: > using checkpoint value 6e-06 for minimum learning rate [default0]: > using checkpoint value 183105 for warmup iterations [default0]: > using checkpoint value 200000000 for total number of iterations [default0]: > using checkpoint value cosine for decay style [default4]:[2022-03-03 06:05:39,293] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 332 [default5]:[2022-03-03 06:05:39,708] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 181 [default0]:[2022-03-03 06:05:39,888] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 72 [default0]:[2022-03-03 06:05:39,945] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 176 [default1]:[2022-03-03 06:05:40,144] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 177 [default4]:[2022-03-03 06:05:40,155] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 332 [default7]:[2022-03-03 06:05:40,361] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 335 [default0]:[2022-03-03 06:05:40,467] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 328 [default1]:[2022-03-03 06:05:40,508] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 329 [default4]:[2022-03-03 06:05:40,621] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 180 [default0]:[2022-03-03 06:05:40,644] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 352 [default0]:[2022-03-03 06:05:40,665] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 184 [default0]:[2022-03-03 06:05:40,755] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 72 [default5]:[2022-03-03 06:05:40,689] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 181 [default2]:[2022-03-03 06:05:40,717] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 330 [default3]:[2022-03-03 06:05:40,800] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 331 [default0]:[2022-03-03 06:05:41,187] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 176 [default6]:[2022-03-03 06:05:41,255] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 78 [default1]:[2022-03-03 06:05:41,314] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 177 [default7]:[2022-03-03 06:05:41,314] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 335 [default7]:[2022-03-03 06:05:41,392] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 351 [default4]:[2022-03-03 06:05:41,446] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 284 [default0]:[2022-03-03 06:05:41,402] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 328 [default6]:[2022-03-03 06:05:41,553] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 182 [default0]:[2022-03-03 06:05:41,495] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 184 [default1]:[2022-03-03 06:05:41,501] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 329 [default0]:[2022-03-03 06:05:41,574] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 352 [default0]:[2022-03-03 06:05:41,623] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 344 [default0]:[2022-03-03 06:05:41,650] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 168 [default4]:[2022-03-03 06:05:41,697] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 180 [default4]:[2022-03-03 06:05:41,727] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 364 [default4]:[2022-03-03 06:05:41,745] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 356 [default5]:[2022-03-03 06:05:41,699] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 333 [default4]:[2022-03-03 06:05:41,794] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 36 [default6]:[2022-03-03 06:05:41,843] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 334 [default0]:[2022-03-03 06:05:41,896] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 336 [default3]:[2022-03-03 06:05:41,921] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 339 [default4]:[2022-03-03 06:05:41,891] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 188 [default2]:[2022-03-03 06:05:41,893] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 330 [default3]:[2022-03-03 06:05:41,950] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 331 [default2]:[2022-03-03 06:05:41,978] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 178 [default0]:[2022-03-03 06:05:42,142] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 80 [default6]:[2022-03-03 06:05:42,157] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 78 [default5]:[2022-03-03 06:05:42,141] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 173 [default0]:[2022-03-03 06:05:42,215] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 288 [default5]:[2022-03-03 06:05:42,228] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 77 [default3]:[2022-03-03 06:05:42,327] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 179 [default7]:[2022-03-03 06:05:42,330] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 343 [default7]:[2022-03-03 06:05:42,319] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 351 [default4]:[2022-03-03 06:05:42,305] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 284 [default6]:[2022-03-03 06:05:42,408] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 182 [default4]:[2022-03-03 06:05:42,428] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 76 [default0]:[2022-03-03 06:05:42,408] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 248 [default0]:[2022-03-03 06:05:42,383] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 120 [default2]:[2022-03-03 06:05:42,495] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 74 [default2]:[2022-03-03 06:05:42,565] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 250 [default0]:[2022-03-03 06:05:42,559] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 344 [default1]:[2022-03-03 06:05:42,504] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 33 [default4]:[2022-03-03 06:05:42,630] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 252 [default1]:[2022-03-03 06:05:42,602] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 73 [default7]:[2022-03-03 06:05:42,600] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 191 [default5]:[2022-03-03 06:05:42,640] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 349 [default1]:[2022-03-03 06:05:42,661] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 305 [default4]:[2022-03-03 06:05:42,730] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 124 [default4]:[2022-03-03 06:05:42,716] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 340 [default3]:[2022-03-03 06:05:42,753] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 347 [default5]:[2022-03-03 06:05:42,747] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 189 [default4]:[2022-03-03 06:05:42,728] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 308 [default4]:[2022-03-03 06:05:42,689] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 356 [default4]:[2022-03-03 06:05:42,691] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 36 [default4]:[2022-03-03 06:05:42,768] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 188 [default0]:[2022-03-03 06:05:42,716] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 168 [default5]:[2022-03-03 06:05:42,698] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 333 [default7]:[2022-03-03 06:05:42,769] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 183 [default3]:[2022-03-03 06:05:42,832] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 339 [default4]:[2022-03-03 06:05:42,798] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 364 [default6]:[2022-03-03 06:05:42,865] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 334 [default0]:[2022-03-03 06:05:42,883] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 272 [default2]:[2022-03-03 06:05:42,933] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 178 [default3]:[2022-03-03 06:05:42,901] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 251 [default0]:[2022-03-03 06:05:42,980] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 336 [default0]:[2022-03-03 06:05:43,025] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 80 [default7]:[2022-03-03 06:05:42,988] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 39 [default3]:[2022-03-03 06:05:43,014] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 35 [default4]:[2022-03-03 06:05:43,057] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 300 [default4]:[2022-03-03 06:05:43,079] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 52 [default0]:[2022-03-03 06:05:43,117] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 288 [default0]:[2022-03-03 06:05:43,166] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 280 [default4]:[2022-03-03 06:05:43,118] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 348 [default0]:[2022-03-03 06:05:43,148] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 32 [default3]:[2022-03-03 06:05:43,205] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 179 [default1]:[2022-03-03 06:05:43,186] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 345 [default7]:[2022-03-03 06:05:43,248] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 311 [default6]:[2022-03-03 06:05:43,239] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 310 [default5]:[2022-03-03 06:05:43,256] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 173 [default6]:[2022-03-03 06:05:43,250] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 190 [default4]:[2022-03-03 06:05:43,346] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 84 [default7]:[2022-03-03 06:05:43,309] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 343 [default5]:[2022-03-03 06:05:43,297] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 77 [default7]:[2022-03-03 06:05:43,343] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 79 [default6]:[2022-03-03 06:05:43,328] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 254 [default0]:[2022-03-03 06:05:43,300] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 304 [default5]:[2022-03-03 06:05:43,351] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 37 [default4]:[2022-03-03 06:05:43,441] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 76 [default2]:[2022-03-03 06:05:43,423] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 74 [default6]:[2022-03-03 06:05:43,464] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 350 [default0]:[2022-03-03 06:05:43,416] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 120 [default2]:[2022-03-03 06:05:43,516] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 282 [default7]:[2022-03-03 06:05:43,512] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 255 [default4]:[2022-03-03 06:05:43,523] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 228 [default1]:[2022-03-03 06:05:43,529] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 33 [default1]:[2022-03-03 06:05:43,549] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 305 [default3]:[2022-03-03 06:05:43,537] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 283 [default4]:[2022-03-03 06:05:43,547] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 372 [default0]:[2022-03-03 06:05:43,522] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 256 [default4]:[2022-03-03 06:05:43,597] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 276 [default7]:[2022-03-03 06:05:43,648] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 183 [default4]:[2022-03-03 06:05:43,656] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 124 [default4]:[2022-03-03 06:05:43,647] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 292 [default2]:[2022-03-03 06:05:43,643] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 186 [default6]:[2022-03-03 06:05:43,611] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 342 [default1]:[2022-03-03 06:05:43,613] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 73 [default0]:[2022-03-03 06:05:43,634] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 192 [default7]:[2022-03-03 06:05:43,583] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 191 [default5]:[2022-03-03 06:05:43,588] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 349 [default4]:[2022-03-03 06:05:43,605] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 324 [default1]:[2022-03-03 06:05:43,639] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 169 [default2]:[2022-03-03 06:05:43,652] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 202 [default4]:[2022-03-03 06:05:43,646] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 204 [default5]:[2022-03-03 06:05:43,584] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 53 [default4]:[2022-03-03 06:05:43,680] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 340 [default3]:[2022-03-03 06:05:43,692] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 347 [default5]:[2022-03-03 06:05:43,677] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 189 [default7]:[2022-03-03 06:05:43,753] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 287 [default4]:[2022-03-03 06:05:43,687] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 308 [default0]:[2022-03-03 06:05:43,687] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 248 [default2]:[2022-03-03 06:05:43,755] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 346 [default0]:[2022-03-03 06:05:43,706] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 264 [default4]:[2022-03-03 06:05:43,721] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 268 [default4]:[2022-03-03 06:05:43,794] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 148 [default2]:[2022-03-03 06:05:43,831] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 338 [default2]:[2022-03-03 06:05:43,796] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 250 [default2]:[2022-03-03 06:05:43,787] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 306 [default0]:[2022-03-03 06:05:43,803] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 40 [default0]:[2022-03-03 06:05:43,921] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 272 [default3]:[2022-03-03 06:05:43,867] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 75 [default4]:[2022-03-03 06:05:43,868] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 252 [default7]:[2022-03-03 06:05:43,942] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 127 [default3]:[2022-03-03 06:05:43,930] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 307 [default4]:[2022-03-03 06:05:43,884] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 172 [default5]:[2022-03-03 06:05:43,935] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 309 [default4]:[2022-03-03 06:05:43,970] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 300 [default0]:[2022-03-03 06:05:43,914] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 368 [default6]:[2022-03-03 06:05:44,058] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 150 [default1]:[2022-03-03 06:05:44,046] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 249 [default1]:[2022-03-03 06:05:44,038] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 185 [default3]:[2022-03-03 06:05:43,971] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 251 [default0]:[2022-03-03 06:05:44,065] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 88 [default5]:[2022-03-03 06:05:44,157] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 277 [default1]:[2022-03-03 06:05:44,071] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 345 [default4]:[2022-03-03 06:05:44,126] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 348 [default4]:[2022-03-03 06:05:44,074] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 60 [default0]:[2022-03-03 06:05:44,081] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 224 [default7]:[2022-03-03 06:05:44,120] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 39 [default4]:[2022-03-03 06:05:44,118] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 44 [default0]:[2022-03-03 06:05:44,171] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 48 [default0]:[2022-03-03 06:05:44,186] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 360 [default5]:[2022-03-03 06:05:44,184] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 253 [default7]:[2022-03-03 06:05:44,247] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 79 [default7]:[2022-03-03 06:05:44,262] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 87 [default4]:[2022-03-03 06:05:44,180] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 196 [default0]:[2022-03-03 06:05:44,179] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 280 [default6]:[2022-03-03 06:05:44,199] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 38 [default2]:[2022-03-03 06:05:44,219] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 34 [default5]:[2022-03-03 06:05:44,242] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 205 [default5]:[2022-03-03 06:05:44,260] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 45 [default7]:[2022-03-03 06:05:44,267] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 175 [default6]:[2022-03-03 06:05:44,194] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 190 [default4]:[2022-03-03 06:05:44,185] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 52 [default6]:[2022-03-03 06:05:44,369] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 254 [default4]:[2022-03-03 06:05:44,362] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 228 [default6]:[2022-03-03 06:05:44,289] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 174 [default0]:[2022-03-03 06:05:44,365] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 200 [default0]:[2022-03-03 06:05:44,372] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 232 [default5]:[2022-03-03 06:05:44,377] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 341 [default1]:[2022-03-03 06:05:44,444] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 121 [default0]:[2022-03-03 06:05:44,409] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 304 [default0]:[2022-03-03 06:05:44,455] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 256 [default5]:[2022-03-03 06:05:44,473] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 53 [default4]:[2022-03-03 06:05:44,517] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 276 [default2]:[2022-03-03 06:05:44,526] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 122 [default7]:[2022-03-03 06:05:44,523] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 279 [default3]:[2022-03-03 06:05:44,554] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 187 [default2]:[2022-03-03 06:05:44,547] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 186 [default4]:[2022-03-03 06:05:44,548] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 84 [default6]:[2022-03-03 06:05:44,473] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 350 [default2]:[2022-03-03 06:05:44,570] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 282 [default5]:[2022-03-03 06:05:44,484] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 285 [default4]:[2022-03-03 06:05:44,508] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 324 [default0]:[2022-03-03 06:05:44,497] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 32 [default7]:[2022-03-03 06:05:44,544] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 255 [default4]:[2022-03-03 06:05:44,530] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 372 [default3]:[2022-03-03 06:05:44,625] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 275 [default4]:[2022-03-03 06:05:44,578] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 292 [default3]:[2022-03-03 06:05:44,627] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 123 [default6]:[2022-03-03 06:05:44,608] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 342 [default2]:[2022-03-03 06:05:44,651] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 338 [default0]:[2022-03-03 06:05:44,652] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 192 [default6]:[2022-03-03 06:05:44,573] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 286 [default1]:[2022-03-03 06:05:44,606] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 209 [default3]:[2022-03-03 06:05:44,664] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 291 [default2]:[2022-03-03 06:05:44,667] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 346 [default0]:[2022-03-03 06:05:44,647] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 264 [default6]:[2022-03-03 06:05:44,595] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 310 [default3]:[2022-03-03 06:05:44,583] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 283 [default4]:[2022-03-03 06:05:44,624] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 268 [default0]:[2022-03-03 06:05:44,665] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 216 [default2]:[2022-03-03 06:05:44,582] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 202 [default4]:[2022-03-03 06:05:44,620] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 156 [default0]:[2022-03-03 06:05:44,623] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 40 [default6]:[2022-03-03 06:05:44,627] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 206 [default5]:[2022-03-03 06:05:44,700] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 149 [default1]:[2022-03-03 06:05:44,761] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 337 [default4]:[2022-03-03 06:05:44,721] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 236 [default4]:[2022-03-03 06:05:44,728] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 148 [default5]:[2022-03-03 06:05:44,690] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 125 [default1]:[2022-03-03 06:05:44,683] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 281 [default7]:[2022-03-03 06:05:44,678] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 311 [default3]:[2022-03-03 06:05:44,686] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 35 [default4]:[2022-03-03 06:05:44,755] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 380 [default1]:[2022-03-03 06:05:44,717] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 169 [default1]:[2022-03-03 06:05:44,690] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 201 [default4]:[2022-03-03 06:05:44,711] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 204 [default0]:[2022-03-03 06:05:44,748] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 296 [default3]:[2022-03-03 06:05:44,761] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 211 [default3]:[2022-03-03 06:05:44,742] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 171 [default0]:[2022-03-03 06:05:44,830] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 144 [default3]:[2022-03-03 06:05:44,799] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 75 [default6]:[2022-03-03 06:05:44,820] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 86 [default3]:[2022-03-03 06:05:44,811] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 235 [default0]:[2022-03-03 06:05:44,797] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 320 [default3]:[2022-03-03 06:05:44,791] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 83 [default7]:[2022-03-03 06:05:44,776] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 287 [default0]:[2022-03-03 06:05:44,826] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 56 [default6]:[2022-03-03 06:05:44,776] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 62 [default3]:[2022-03-03 06:05:44,826] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 203 [default2]:[2022-03-03 06:05:44,808] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 170 [default2]:[2022-03-03 06:05:44,783] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 298 [default7]:[2022-03-03 06:05:44,815] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 207 [default6]:[2022-03-03 06:05:44,791] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 54 [default2]:[2022-03-03 06:05:44,904] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 274 [default2]:[2022-03-03 06:05:44,914] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 146 [default1]:[2022-03-03 06:05:44,959] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 81 [default6]:[2022-03-03 06:05:44,871] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 126 [default5]:[2022-03-03 06:05:44,907] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 85 [default6]:[2022-03-03 06:05:44,887] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 198 [default6]:[2022-03-03 06:05:44,970] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 30 [default0]:[2022-03-03 06:05:44,963] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 224 [default3]:[2022-03-03 06:05:44,893] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 227 [default1]:[2022-03-03 06:05:44,972] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 353 [default2]:[2022-03-03 06:05:44,941] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 306 [default5]:[2022-03-03 06:05:44,936] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 381 [default5]:[2022-03-03 06:05:44,956] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 309 [default7]:[2022-03-03 06:05:44,948] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 239 [default1]:[2022-03-03 06:05:44,890] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 161 [default0]:[2022-03-03 06:05:44,984] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 8 [default1]:[2022-03-03 06:05:44,989] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 273 [default2]:[2022-03-03 06:05:45,023] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 82 [default2]:[2022-03-03 06:05:44,998] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 234 [default1]:[2022-03-03 06:05:44,979] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 185 [default4]:[2022-03-03 06:05:45,032] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 196 [default6]:[2022-03-03 06:05:44,992] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 214 [default4]:[2022-03-03 06:05:44,980] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 92 [default0]:[2022-03-03 06:05:45,058] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 88 [default7]:[2022-03-03 06:05:45,006] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 295 [default2]:[2022-03-03 06:05:45,034] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 58 [default4]:[2022-03-03 06:05:45,051] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 60 [default2]:[2022-03-03 06:05:44,986] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 322 [default4]:[2022-03-03 06:05:45,029] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 172 [default3]:[2022-03-03 06:05:45,067] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 259 [default3]:[2022-03-03 06:05:45,046] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 195 [default7]:[2022-03-03 06:05:45,068] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 215 [default0]:[2022-03-03 06:05:45,005] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 368 [default0]:[2022-03-03 06:05:45,052] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 160 [default6]:[2022-03-03 06:05:45,072] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 150 [default6]:[2022-03-03 06:05:45,108] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 278 [default7]:[2022-03-03 06:05:45,074] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 87 [default5]:[2022-03-03 06:05:45,137] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 61 [default6]:[2022-03-03 06:05:45,135] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 94 [default5]:[2022-03-03 06:05:45,094] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 269 [default5]:[2022-03-03 06:05:45,077] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 37 [default5]:[2022-03-03 06:05:45,118] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 261 [default4]:[2022-03-03 06:05:45,108] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 260 [default4]:[2022-03-03 06:05:45,108] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 212 [default1]:[2022-03-03 06:05:45,107] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 297 [default6]:[2022-03-03 06:05:45,108] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 302 [default1]:[2022-03-03 06:05:45,094] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 49 [default0]:[2022-03-03 06:05:45,183] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 48 [default4]:[2022-03-03 06:05:45,235] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 68 [default5]:[2022-03-03 06:05:45,245] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 277 [default6]:[2022-03-03 06:05:45,169] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 238 [default7]:[2022-03-03 06:05:45,175] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 127 [default0]:[2022-03-03 06:05:45,257] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 360 [default5]:[2022-03-03 06:05:45,211] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 253 [default3]:[2022-03-03 06:05:45,178] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 307 [default7]:[2022-03-03 06:05:45,219] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 271 [default3]:[2022-03-03 06:05:45,178] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 267 [default4]:[2022-03-03 06:05:45,191] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 220 [default4]:[2022-03-03 06:05:45,196] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 44 [default1]:[2022-03-03 06:05:45,329] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 249 [default5]:[2022-03-03 06:05:45,276] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 341 [default0]:[2022-03-03 06:05:45,350] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 24 [default7]:[2022-03-03 06:05:45,336] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 383 [default1]:[2022-03-03 06:05:45,340] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 257 [default0]:[2022-03-03 06:05:45,308] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 200 [default5]:[2022-03-03 06:05:45,343] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 45 [default7]:[2022-03-03 06:05:45,376] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 167 [default4]:[2022-03-03 06:05:45,371] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 164 [default4]:[2022-03-03 06:05:45,388] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 12 [default3]:[2022-03-03 06:05:45,388] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 147 [default0]:[2022-03-03 06:05:45,383] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 232 [default1]:[2022-03-03 06:05:45,425] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 233 [default7]:[2022-03-03 06:05:45,450] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 279 [default3]:[2022-03-03 06:05:45,385] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 187 [default7]:[2022-03-03 06:05:45,397] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 95 [default3]:[2022-03-03 06:05:45,437] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 59 [default1]:[2022-03-03 06:05:45,416] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 57 [default5]:[2022-03-03 06:05:45,402] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 205 [default6]:[2022-03-03 06:05:45,398] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 230 [default4]:[2022-03-03 06:05:45,470] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 156 [default7]:[2022-03-03 06:05:45,451] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 175 [default7]:[2022-03-03 06:05:45,468] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 151 [default4]:[2022-03-03 06:05:45,467] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 100 [default1]:[2022-03-03 06:05:45,568] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 289 [default5]:[2022-03-03 06:05:45,568] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 285 [default1]:[2022-03-03 06:05:45,531] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 209 [default5]:[2022-03-03 06:05:45,509] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 221 [default3]:[2022-03-03 06:05:45,573] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 91 [default7]:[2022-03-03 06:05:45,480] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 63 [default1]:[2022-03-03 06:05:45,498] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 225 [default6]:[2022-03-03 06:05:45,494] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 326 [default6]:[2022-03-03 06:05:45,492] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 174 [default2]:[2022-03-03 06:05:45,512] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 106 [default3]:[2022-03-03 06:05:45,493] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 107 [default6]:[2022-03-03 06:05:45,518] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 270 [default7]:[2022-03-03 06:05:45,535] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 263 [default5]:[2022-03-03 06:05:45,487] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 213 [default2]:[2022-03-03 06:05:45,516] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 50 [default3]:[2022-03-03 06:05:45,651] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 275 [default1]:[2022-03-03 06:05:45,614] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 145 [default1]:[2022-03-03 06:05:45,601] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 337 [default2]:[2022-03-03 06:05:45,591] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 194 [default3]:[2022-03-03 06:05:45,598] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 291 [default5]:[2022-03-03 06:05:45,616] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 293 [default4]:[2022-03-03 06:05:45,610] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 116 [default6]:[2022-03-03 06:05:45,672] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 358 [default1]:[2022-03-03 06:05:45,646] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 265 [default2]:[2022-03-03 06:05:45,617] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 378 [default0]:[2022-03-03 06:05:45,673] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 216 [default1]:[2022-03-03 06:05:45,643] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 105 [default6]:[2022-03-03 06:05:45,766] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 86 [default5]:[2022-03-03 06:05:45,696] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 237 [default5]:[2022-03-03 06:05:45,745] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 197 [default1]:[2022-03-03 06:05:45,682] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 281 [default6]:[2022-03-03 06:05:45,730] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 286 [default0]:[2022-03-03 06:05:45,744] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 64 [default5]:[2022-03-03 06:05:45,747] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 93 [default5]:[2022-03-03 06:05:45,760] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 365 [default2]:[2022-03-03 06:05:45,756] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 266 [default1]:[2022-03-03 06:05:45,741] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 377 [default6]:[2022-03-03 06:05:45,730] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 110 [default2]:[2022-03-03 06:05:45,733] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 258 [default6]:[2022-03-03 06:05:45,717] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 46 [default1]:[2022-03-03 06:05:45,728] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 41 [default7]:[2022-03-03 06:05:45,704] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 303 [default4]:[2022-03-03 06:05:45,756] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 20 [default6]:[2022-03-03 06:05:45,697] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 206 [default3]:[2022-03-03 06:05:45,770] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 171 [default6]:[2022-03-03 06:05:45,731] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 54 [default1]:[2022-03-03 06:05:45,836] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 81 [default1]:[2022-03-03 06:05:45,833] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 361 [default1]:[2022-03-03 06:05:45,849] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 121 [default0]:[2022-03-03 06:05:45,789] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 208 [default7]:[2022-03-03 06:05:45,803] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 199 [default3]:[2022-03-03 06:05:45,786] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 227 [default6]:[2022-03-03 06:05:45,865] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 382 [default1]:[2022-03-03 06:05:45,813] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 193 [default2]:[2022-03-03 06:05:45,824] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 170 [default6]:[2022-03-03 06:05:45,795] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 262 [default2]:[2022-03-03 06:05:45,813] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 226 [default7]:[2022-03-03 06:05:45,858] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 47 [default3]:[2022-03-03 06:05:45,792] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 299 [default2]:[2022-03-03 06:05:45,824] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 298 [default5]:[2022-03-03 06:05:45,866] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 229 [default0]:[2022-03-03 06:05:45,831] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 296 [default3]:[2022-03-03 06:05:45,849] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 211 [default3]:[2022-03-03 06:05:45,862] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 51 [default1]:[2022-03-03 06:05:45,868] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 161 [default1]:[2022-03-03 06:05:45,889] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 9 [default1]:[2022-03-03 06:05:45,954] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 321 [default6]:[2022-03-03 06:05:45,965] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 294 [default0]:[2022-03-03 06:05:45,909] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 320 [default2]:[2022-03-03 06:05:45,947] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 290 [default5]:[2022-03-03 06:05:45,943] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 85 [default3]:[2022-03-03 06:05:45,952] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 83 [default6]:[2022-03-03 06:05:45,961] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 198 [default6]:[2022-03-03 06:05:45,923] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 30 [default2]:[2022-03-03 06:05:45,909] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 210 [default2]:[2022-03-03 06:05:45,887] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 90 [default1]:[2022-03-03 06:05:45,901] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 113 [default7]:[2022-03-03 06:05:45,940] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 231 [default5]:[2022-03-03 06:05:45,917] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 325 [default2]:[2022-03-03 06:05:45,929] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 322 [default6]:[2022-03-03 06:05:45,956] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 38 [default2]:[2022-03-03 06:05:45,920] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 34 [default1]:[2022-03-03 06:05:45,928] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 1 [default3]:[2022-03-03 06:05:45,908] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 203 [default5]:[2022-03-03 06:05:45,941] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 109 [default1]:[2022-03-03 06:05:45,948] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 201 [default3]:[2022-03-03 06:05:45,956] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 43 [default2]:[2022-03-03 06:05:45,947] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 42 [default2]:[2022-03-03 06:05:45,938] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 162 [default6]:[2022-03-03 06:05:45,984] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 166 [default3]:[2022-03-03 06:05:45,969] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 163 [default6]:[2022-03-03 06:05:46,004] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 318 [default2]:[2022-03-03 06:05:45,985] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 122 [default3]:[2022-03-03 06:05:46,041] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 323 [default3]:[2022-03-03 06:05:46,003] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 123 [default7]:[2022-03-03 06:05:46,015] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 295 [default4]:[2022-03-03 06:05:46,036] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 92 [default0]:[2022-03-03 06:05:46,032] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 104 [default4]:[2022-03-03 06:05:45,996] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 108 [default7]:[2022-03-03 06:05:46,004] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 207 [default1]:[2022-03-03 06:05:46,082] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 49 [default0]:[2022-03-03 06:05:46,063] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 8 [default4]:[2022-03-03 06:05:46,115] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 316 [default3]:[2022-03-03 06:05:46,117] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 67 [default2]:[2022-03-03 06:05:46,149] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 146 [default2]:[2022-03-03 06:05:46,098] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 82 [default6]:[2022-03-03 06:05:46,071] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 126 [default0]:[2022-03-03 06:05:46,145] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 312 [default7]:[2022-03-03 06:05:46,081] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 31 [default1]:[2022-03-03 06:05:46,090] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 353 [default3]:[2022-03-03 06:05:46,091] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 379 [default3]:[2022-03-03 06:05:46,157] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 259 [default3]:[2022-03-03 06:05:46,124] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 195 [default0]:[2022-03-03 06:05:46,106] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 152 [default5]:[2022-03-03 06:05:46,105] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 301 [default3]:[2022-03-03 06:05:46,117] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 363 [default6]:[2022-03-03 06:05:46,097] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 302 [default5]:[2022-03-03 06:05:46,157] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 165 [default1]:[2022-03-03 06:05:46,166] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 273 [default2]:[2022-03-03 06:05:46,180] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 274 [default5]:[2022-03-03 06:05:46,181] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 149 [default2]:[2022-03-03 06:05:46,265] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 314 [default6]:[2022-03-03 06:05:46,231] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 214 [default6]:[2022-03-03 06:05:46,210] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 94 [default0]:[2022-03-03 06:05:46,235] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 56 [default7]:[2022-03-03 06:05:46,205] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 327 [default3]:[2022-03-03 06:05:46,257] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 267 [default2]:[2022-03-03 06:05:46,179] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 218 [default4]:[2022-03-03 06:05:46,231] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 260 [default5]:[2022-03-03 06:05:46,236] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 261 [default4]:[2022-03-03 06:05:46,232] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 132 [default1]:[2022-03-03 06:05:46,204] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 297 [default6]:[2022-03-03 06:05:46,265] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 374 [default0]:[2022-03-03 06:05:46,256] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 160 [default4]:[2022-03-03 06:05:46,314] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 68 [default6]:[2022-03-03 06:05:46,280] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 278 [default5]:[2022-03-03 06:05:46,300] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 125 [default0]:[2022-03-03 06:05:46,345] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 24 [default2]:[2022-03-03 06:05:46,349] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 114 [default5]:[2022-03-03 06:05:46,314] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 61 [default2]:[2022-03-03 06:05:46,301] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 58 [default5]:[2022-03-03 06:05:46,276] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 269 [default4]:[2022-03-03 06:05:46,319] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 380 [default6]:[2022-03-03 06:05:46,351] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 366 [default7]:[2022-03-03 06:05:46,350] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 167 [default2]:[2022-03-03 06:05:46,360] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 50 [default0]:[2022-03-03 06:05:46,446] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 144 [default3]:[2022-03-03 06:05:46,390] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 147 [default4]:[2022-03-03 06:05:46,407] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 236 [default4]:[2022-03-03 06:05:46,389] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 100 [default1]:[2022-03-03 06:05:46,372] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 25 [default7]:[2022-03-03 06:05:46,448] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 119 [default6]:[2022-03-03 06:05:46,385] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 118 [default1]:[2022-03-03 06:05:46,381] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 89 [default2]:[2022-03-03 06:05:46,444] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 362 [default2]:[2022-03-03 06:05:46,423] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 354 [default4]:[2022-03-03 06:05:46,409] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 220 [default1]:[2022-03-03 06:05:46,450] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 257 [default7]:[2022-03-03 06:05:46,413] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 111 [default7]:[2022-03-03 06:05:46,446] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 367 [default4]:[2022-03-03 06:05:46,403] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 212 [default7]:[2022-03-03 06:05:46,391] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 215 [default7]:[2022-03-03 06:05:46,429] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 55 [default7]:[2022-03-03 06:05:46,482] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 151 [default1]:[2022-03-03 06:05:46,530] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 289 [default0]:[2022-03-03 06:05:46,495] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 0 [default7]:[2022-03-03 06:05:46,549] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 103 [default1]:[2022-03-03 06:05:46,531] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 217 [default5]:[2022-03-03 06:05:46,558] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 381 [default4]:[2022-03-03 06:05:46,550] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 4 [default0]:[2022-03-03 06:05:46,573] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 376 [default6]:[2022-03-03 06:05:46,533] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 222 [default4]:[2022-03-03 06:05:46,527] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 164 [default4]:[2022-03-03 06:05:46,527] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 12 [default2]:[2022-03-03 06:05:46,640] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 66 [default1]:[2022-03-03 06:05:46,597] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 145 [default0]:[2022-03-03 06:05:46,626] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 96 [default6]:[2022-03-03 06:05:46,636] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 102 [default4]:[2022-03-03 06:05:46,582] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 116 [default3]:[2022-03-03 06:05:46,672] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 91 [default6]:[2022-03-03 06:05:46,670] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 62 [default6]:[2022-03-03 06:05:46,619] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 326 [default5]:[2022-03-03 06:05:46,599] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 357 [default7]:[2022-03-03 06:05:46,618] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 359 [default0]:[2022-03-03 06:05:46,673] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 136 [default3]:[2022-03-03 06:05:46,591] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 355 [default3]:[2022-03-03 06:05:46,602] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 107 [default6]:[2022-03-03 06:05:46,677] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 46 [default0]:[2022-03-03 06:05:46,663] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 16 [default5]:[2022-03-03 06:05:46,623] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 213 [default1]:[2022-03-03 06:05:46,685] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 65 [default6]:[2022-03-03 06:05:46,747] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 70 [default3]:[2022-03-03 06:05:46,714] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 315 [default2]:[2022-03-03 06:05:46,764] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 290 [default3]:[2022-03-03 06:05:46,765] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 27 [default0]:[2022-03-03 06:05:46,673] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 112 [default5]:[2022-03-03 06:05:46,703] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 221 [default7]:[2022-03-03 06:05:46,713] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 223 [default1]:[2022-03-03 06:05:46,755] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 57 [default5]:[2022-03-03 06:05:46,686] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 365 [default7]:[2022-03-03 06:05:46,683] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 271 [default5]:[2022-03-03 06:05:46,725] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 117 [default5]:[2022-03-03 06:05:46,761] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 29 [default3]:[2022-03-03 06:05:46,715] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 299 [default7]:[2022-03-03 06:05:46,768] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 303 [default4]:[2022-03-03 06:05:46,689] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 20 [default3]:[2022-03-03 06:05:46,736] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 51 [default6]:[2022-03-03 06:05:46,702] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 6 [default7]:[2022-03-03 06:05:46,863] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 319 [default2]:[2022-03-03 06:05:46,786] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 26 [default5]:[2022-03-03 06:05:46,805] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 69 [default2]:[2022-03-03 06:05:46,862] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 210 [default7]:[2022-03-03 06:05:46,846] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 199 [default7]:[2022-03-03 06:05:46,815] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 95 [default1]:[2022-03-03 06:05:46,857] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 225 [default2]:[2022-03-03 06:05:46,806] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 258 [default1]:[2022-03-03 06:05:46,780] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 41 [default1]:[2022-03-03 06:05:46,796] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 153 [default3]:[2022-03-03 06:05:46,806] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 219 [default1]:[2022-03-03 06:05:46,845] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 9 [default2]:[2022-03-03 06:05:46,873] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 234 [default1]:[2022-03-03 06:05:46,896] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 361 [default0]:[2022-03-03 06:05:46,940] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 240 [default4]:[2022-03-03 06:05:46,909] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 244 [default5]:[2022-03-03 06:05:46,969] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 293 [default7]:[2022-03-03 06:05:46,963] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 63 [default3]:[2022-03-03 06:05:46,948] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 115 [default6]:[2022-03-03 06:05:46,880] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 358 [default1]:[2022-03-03 06:05:46,959] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 1 [default2]:[2022-03-03 06:05:46,897] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 106 [default3]:[2022-03-03 06:05:46,960] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 43 [default7]:[2022-03-03 06:05:46,921] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 47 [default5]:[2022-03-03 06:05:46,885] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 133 [default5]:[2022-03-03 06:05:46,979] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 301 [default3]:[2022-03-03 06:05:46,975] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 3 [default7]:[2022-03-03 06:05:46,910] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 263 [default3]:[2022-03-03 06:05:46,990] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 163 [default7]:[2022-03-03 06:05:46,968] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 71 [default1]:[2022-03-03 06:05:47,044] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 313 [default5]:[2022-03-03 06:05:46,983] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 101 [default1]:[2022-03-03 06:05:47,057] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 233 [default6]:[2022-03-03 06:05:47,010] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 238 [default6]:[2022-03-03 06:05:46,999] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 294 [default4]:[2022-03-03 06:05:46,993] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 28 [default1]:[2022-03-03 06:05:46,985] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 241 [default3]:[2022-03-03 06:05:47,049] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 59 [default2]:[2022-03-03 06:05:46,980] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 90 [default5]:[2022-03-03 06:05:46,983] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 5 [default1]:[2022-03-03 06:05:47,062] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 265 [default0]:[2022-03-03 06:05:47,032] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 104 [default2]:[2022-03-03 06:05:47,057] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 226 [default6]:[2022-03-03 06:05:46,990] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 262 [default6]:[2022-03-03 06:05:47,053] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 110 [default1]:[2022-03-03 06:05:46,983] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 105 [default2]:[2022-03-03 06:05:47,055] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 42 [default2]:[2022-03-03 06:05:47,049] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 370 [default5]:[2022-03-03 06:05:47,058] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 13 [default2]:[2022-03-03 06:05:47,065] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 10 [default0]:[2022-03-03 06:05:47,166] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 312 [default1]:[2022-03-03 06:05:47,120] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 321 [default2]:[2022-03-03 06:05:47,117] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 194 [default5]:[2022-03-03 06:05:47,170] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 93 [default1]:[2022-03-03 06:05:47,124] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 113 [default5]:[2022-03-03 06:05:47,089] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 325 [default2]:[2022-03-03 06:05:47,162] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 266 [default2]:[2022-03-03 06:05:47,096] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 218 [default6]:[2022-03-03 06:05:47,145] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 230 [default6]:[2022-03-03 06:05:47,139] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 270 [default0]:[2022-03-03 06:05:47,111] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 152 [default4]:[2022-03-03 06:05:47,126] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 132 [default3]:[2022-03-03 06:05:47,122] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 363 [default7]:[2022-03-03 06:05:47,129] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 375 [default7]:[2022-03-03 06:05:47,170] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 239 [default6]:[2022-03-03 06:05:47,178] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 318 [default3]:[2022-03-03 06:05:47,226] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 235 [default5]:[2022-03-03 06:05:47,213] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 197 [default5]:[2022-03-03 06:05:47,263] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 245 [default6]:[2022-03-03 06:05:47,268] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 246 [default7]:[2022-03-03 06:05:47,175] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 31 [default0]:[2022-03-03 06:05:47,191] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 208 [default2]:[2022-03-03 06:05:47,195] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 2 [default0]:[2022-03-03 06:05:47,179] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 64 [default7]:[2022-03-03 06:05:47,268] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 327 [default7]:[2022-03-03 06:05:47,196] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 383 [default1]:[2022-03-03 06:05:47,216] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 193 [default5]:[2022-03-03 06:05:47,210] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 157 [default5]:[2022-03-03 06:05:47,208] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 229 [default3]:[2022-03-03 06:05:47,270] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 371 [default1]:[2022-03-03 06:05:47,201] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 369 [default5]:[2022-03-03 06:05:47,233] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 165 [default2]:[2022-03-03 06:05:47,195] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 162 [default6]:[2022-03-03 06:05:47,242] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 166 [default7]:[2022-03-03 06:05:47,275] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 55 [default7]:[2022-03-03 06:05:47,277] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 15 [default3]:[2022-03-03 06:05:47,323] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 67 [default5]:[2022-03-03 06:05:47,358] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 317 [default3]:[2022-03-03 06:05:47,277] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 323 [default2]:[2022-03-03 06:05:47,355] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 242 [default2]:[2022-03-03 06:05:47,282] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 114 [default2]:[2022-03-03 06:05:47,288] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 378 [default1]:[2022-03-03 06:05:47,362] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 377 [default5]:[2022-03-03 06:05:47,376] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 373 [default7]:[2022-03-03 06:05:47,345] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 7 [default3]:[2022-03-03 06:05:47,302] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 11 [default6]:[2022-03-03 06:05:47,331] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 14 [default5]:[2022-03-03 06:05:47,382] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 237 [default2]:[2022-03-03 06:05:47,382] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 314 [default3]:[2022-03-03 06:05:47,441] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 99 [default1]:[2022-03-03 06:05:47,455] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 25 [default2]:[2022-03-03 06:05:47,440] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 98 [default7]:[2022-03-03 06:05:47,468] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 231 [default3]:[2022-03-03 06:05:47,442] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 139 [default0]:[2022-03-03 06:05:47,464] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 128 [default1]:[2022-03-03 06:05:47,489] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 89 [default4]:[2022-03-03 06:05:47,542] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 140 [default5]:[2022-03-03 06:05:47,481] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 109 [default3]:[2022-03-03 06:05:47,495] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 131 [default6]:[2022-03-03 06:05:47,486] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 374 [default1]:[2022-03-03 06:05:47,652] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 97 [default6]:[2022-03-03 06:05:47,663] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 118 [default2]:[2022-03-03 06:05:47,622] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 362 [default3]:[2022-03-03 06:05:47,642] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 379 [default2]:[2022-03-03 06:05:47,630] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 354 [default6]:[2022-03-03 06:05:47,617] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 382 [default5]:[2022-03-03 06:05:47,632] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 117 [default4]:[2022-03-03 06:05:47,594] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 108 [default7]:[2022-03-03 06:05:47,645] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 111 [default1]:[2022-03-03 06:05:47,596] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 129 [default0]:[2022-03-03 06:05:47,624] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 16 [default3]:[2022-03-03 06:05:47,670] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 243 [default7]:[2022-03-03 06:05:47,672] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 135 [default4]:[2022-03-03 06:05:47,755] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 316 [default7]:[2022-03-03 06:05:47,769] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 119 [default7]:[2022-03-03 06:05:47,706] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 367 [default6]:[2022-03-03 06:05:47,743] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 22 [default2]:[2022-03-03 06:05:47,683] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 130 [default7]:[2022-03-03 06:05:47,691] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 143 [default6]:[2022-03-03 06:05:47,686] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 366 [default6]:[2022-03-03 06:05:47,692] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 142 [default0]:[2022-03-03 06:05:47,801] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 112 [default5]:[2022-03-03 06:05:47,842] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 141 [default3]:[2022-03-03 06:05:47,856] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 115 [default3]:[2022-03-03 06:05:47,824] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 355 [default5]:[2022-03-03 06:05:47,876] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 21 [default2]:[2022-03-03 06:05:47,868] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 154 [default3]:[2022-03-03 06:05:47,860] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 155 [default6]:[2022-03-03 06:05:47,830] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 222 [default6]:[2022-03-03 06:05:47,850] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 134 [default1]:[2022-03-03 06:05:47,925] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 65 [default3]:[2022-03-03 06:05:47,965] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 315 [default1]:[2022-03-03 06:05:47,943] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 313 [default5]:[2022-03-03 06:05:47,951] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 101 [default3]:[2022-03-03 06:05:47,913] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 27 [default2]:[2022-03-03 06:05:47,912] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 138 [default0]:[2022-03-03 06:05:47,929] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 0 [default0]: checkpoint version 3.0 [default0]:[2022-03-03 06:05:47,945] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 376 [default2]:[2022-03-03 06:05:47,892] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 18 [default3]:[2022-03-03 06:05:47,970] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 219 [default1]:[2022-03-03 06:05:47,916] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 153 [default6]:[2022-03-03 06:05:47,958] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 6 [default2]:[2022-03-03 06:05:47,971] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 66 [default7]:[2022-03-03 06:05:48,009] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 247 [default7]:[2022-03-03 06:05:48,051] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 223 [default7]:[2022-03-03 06:05:48,055] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 359 [default5]:[2022-03-03 06:05:48,030] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 357 [default0]:[2022-03-03 06:05:48,003] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 136 [default5]:[2022-03-03 06:05:48,015] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 133 [default3]:[2022-03-03 06:05:48,078] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 3 [default2]:[2022-03-03 06:05:48,143] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 26 [default7]:[2022-03-03 06:05:48,111] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 103 [default6]:[2022-03-03 06:05:48,147] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 102 [default1]:[2022-03-03 06:05:48,130] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 217 [default1]:[2022-03-03 06:05:48,094] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 137 [default5]:[2022-03-03 06:05:48,180] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 29 [default7]:[2022-03-03 06:05:48,168] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 159 [default7]:[2022-03-03 06:05:48,138] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 23 [default6]:[2022-03-03 06:05:48,131] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 158 [default0]:[2022-03-03 06:05:48,237] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 96 [default7]:[2022-03-03 06:05:48,183] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 319 [default5]:[2022-03-03 06:05:48,272] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 69 [default4]:[2022-03-03 06:05:48,238] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 244 [default3]:[2022-03-03 06:05:48,226] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 19 [default1]:[2022-03-03 06:05:48,206] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 369 [default2]:[2022-03-03 06:05:48,243] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 10 [default5]:[2022-03-03 06:05:48,318] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 317 [default0]:[2022-03-03 06:05:48,367] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 240 [default1]:[2022-03-03 06:05:48,311] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 17 [default7]:[2022-03-03 06:05:48,447] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 71 [default6]:[2022-03-03 06:05:48,413] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 70 [default4]:[2022-03-03 06:05:48,452] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 28 [default6]:[2022-03-03 06:05:48,470] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 246 [default2]:[2022-03-03 06:05:48,410] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 98 [default1]:[2022-03-03 06:05:48,429] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 241 [default5]:[2022-03-03 06:05:48,414] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 157 [default4]:[2022-03-03 06:05:48,553] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 4 [default3]:[2022-03-03 06:05:48,522] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 139 [default0]:[2022-03-03 06:05:48,510] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 128 [default2]:[2022-03-03 06:05:48,536] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 370 [default3]:[2022-03-03 06:05:48,560] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 371 [default7]:[2022-03-03 06:05:48,510] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 375 [default3]:[2022-03-03 06:05:48,527] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 11 [default5]:[2022-03-03 06:05:48,529] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 13 [default3]:[2022-03-03 06:05:48,656] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 99 [default2]:[2022-03-03 06:05:48,611] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 242 [default5]:[2022-03-03 06:05:48,597] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 245 [default7]:[2022-03-03 06:05:48,673] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 135 [default7]:[2022-03-03 06:05:48,599] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 7 [default5]:[2022-03-03 06:05:48,712] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 5 [default3]:[2022-03-03 06:05:48,733] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 131 [default5]:[2022-03-03 06:05:48,739] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 373 [default7]:[2022-03-03 06:05:48,764] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 15 [default2]:[2022-03-03 06:05:48,788] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 2 [default1]:[2022-03-03 06:05:48,849] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 129 [default7]:[2022-03-03 06:05:48,816] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 143 [default3]:[2022-03-03 06:05:48,829] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 243 [default6]:[2022-03-03 06:05:48,827] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 142 [default6]:[2022-03-03 06:05:48,813] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 14 [default1]:[2022-03-03 06:05:48,909] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 97 [default2]:[2022-03-03 06:05:48,919] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 138 [default5]:[2022-03-03 06:05:49,038] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 141 [default7]:[2022-03-03 06:05:49,015] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 247 [default1]:[2022-03-03 06:05:49,054] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 137 [default4]:[2022-03-03 06:05:48,997] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 140 [default3]:[2022-03-03 06:05:49,069] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 155 [default6]:[2022-03-03 06:05:49,079] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 134 [default2]:[2022-03-03 06:05:49,145] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 18 [default2]:[2022-03-03 06:05:49,088] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 154 [default2]:[2022-03-03 06:05:49,128] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 130 [default3]:[2022-03-03 06:05:49,450] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 19 [default5]:[2022-03-03 06:05:49,463] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 21 [default7]:[2022-03-03 06:05:49,399] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 159 [default6]:[2022-03-03 06:05:49,433] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 158 [default6]:[2022-03-03 06:05:49,541] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 22 [default0]: successfully loaded checkpoint from /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints at iteration 50 [default0]:estimated model parameters: 191.162474496 [default0]:estimated model parameters without embeddings: 148.003086336 [default0]:[after model, optimizer, and learning rate scheduler are built] datetime: 2022-03-03 06:05:49 [default0]:> building train, validation, and test datasets ... [default0]: > datasets target sizes (minimum size): [default0]: train: 220000000 [default0]: validation: 2641920 [default0]: test: 20480 [default0]:> building train, validation, and test datasets for GPT ... [default0]: > building dataset index ... [default7]:time (ms) | load-checkpoint: 25471.88 [default1]:[2022-03-03 06:05:49,617] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 17 [default7]:[2022-03-03 06:05:49,628] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 23 [default0]:/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/utils.py:280: UserWarning: Parameter count with the embeddings will be inaccurate with PP > 1, as the first and last stage hold several copies of the embeddings [default0]: warnings.warn("Parameter count with the embeddings will be inaccurate with PP > 1, as the first and last stage hold several copies of the embeddings") [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.066723 seconds [default0]: number of documents: 1276214 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 1211127) total of 1211127 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_19250640ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_19250640ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_19250640ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.052 seconds [default0]: total number of samples: 19333818 [default0]: total number of epochs: 41 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.013180 seconds [default0]: number of documents: 2218089 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 2104966) total of 2104966 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_4583714ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_4583714ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_4583714ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.082 seconds [default0]: total number of samples: 4602461 [default0]: total number of epochs: 22 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.015850 seconds [default0]: number of documents: 14716427 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 13965889) total of 13965889 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.179 seconds [default0]: total number of samples: 35728792 [default0]: total number of epochs: 4 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002722 seconds [default0]: number of documents: 2767535 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 2626391) total of 2626391 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27456618ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27456618ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27456618ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.084 seconds [default0]: total number of samples: 28139393 [default0]: total number of epochs: 28 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.008013 seconds [default0]: number of documents: 786245 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 746147) total of 746147 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_642209ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_642209ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_642209ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.124 seconds [default0]: total number of samples: 670404 [default0]: total number of epochs: 22 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.023520 seconds [default0]: number of documents: 1748556 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 1659380) total of 1659380 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.098 seconds [default0]: total number of samples: 27952020 [default0]: total number of epochs: 56 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002128 seconds [default0]: number of documents: 29464287 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 27961608) total of 27961608 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_14576562ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_14576562ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_14576562ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.159 seconds [default0]: total number of samples: 14638800 [default0]: total number of epochs: 42 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.019843 seconds [default0]: number of documents: 38304059 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 36350552) total of 36350552 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_26739945ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_26739945ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_26739945ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.183 seconds [default0]: total number of samples: 27308815 [default0]: total number of epochs: 46 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.013062 seconds [default0]: number of documents: 729667 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 692454) total of 692454 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_6868800ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_6868800ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_6868800ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.161 seconds [default0]: total number of samples: 6887421 [default0]: total number of epochs: 22 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.028485 seconds [default0]: number of documents: 24265522 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 23027980) total of 23027980 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_10051887ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_10051887ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_10051887ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.135 seconds [default0]: total number of samples: 10304343 [default0]: total number of epochs: 25 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.022085 seconds [default0]: number of documents: 9587455 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 9098495) total of 9098495 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_28093835ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_28093835ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_28093835ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.231 seconds [default0]: total number of samples: 28924755 [default0]: total number of epochs: 10 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.011283 seconds [default0]: number of documents: 4335929 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 4114797) total of 4114797 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.084 seconds [default0]: total number of samples: 29929866 [default0]: total number of epochs: 11 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002166 seconds [default0]: number of documents: 149731 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 142095) total of 142095 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_122580ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_122580ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_122580ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.024 seconds [default0]: total number of samples: 127855 [default0]: total number of epochs: 18 [default0]:> building indices for blendable datasets ... [default0]: > sample ratios: [default0]: dataset 0, input: 0.0870676, achieved: 0.0870676 [default0]: dataset 1, input: 0.0207314, achieved: 0.0207314 [default0]: dataset 2, input: 0.1247, achieved: 0.1247 [default0]: dataset 3, input: 0.124182, achieved: 0.124182 [default0]: dataset 4, input: 0.0029046, achieved: 0.0029046 [default0]: dataset 5, input: 0.1247, achieved: 0.1247 [default0]: dataset 6, input: 0.0659275, achieved: 0.0659275 [default0]: dataset 7, input: 0.120941, achieved: 0.120941 [default0]: dataset 8, input: 0.0310665, achieved: 0.0310665 [default0]: dataset 9, input: 0.0454631, achieved: 0.0454631 [default0]: dataset 10, input: 0.127064, achieved: 0.127064 [default0]: dataset 11, input: 0.1247, achieved: 0.1247 [default0]: dataset 12, input: 0.000554406, achieved: 0.000554405 [default0]:> elapsed time for building blendable dataset indices: 4.04 (sec) [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002967 seconds [default0]: number of documents: 1276214 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [1211127, 1274938) total of 63811 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_231176ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_231176ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_231176ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.018 seconds [default0]: total number of samples: 241146 [default0]: total number of epochs: 18 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.076098 seconds [default0]: number of documents: 2218089 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [2104966, 2215871) total of 110905 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_55045ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_55045ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_55045ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.010 seconds [default0]: total number of samples: 55872 [default0]: total number of epochs: 5 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002478 seconds [default0]: number of documents: 14716427 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [13965889, 14701711) total of 735822 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.052 seconds [default0]: total number of samples: 1880535 [default0]: total number of epochs: 1 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.007496 seconds [default0]: number of documents: 2767535 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [2626391, 2764767) total of 138376 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_329720ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_329720ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_329720ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.030 seconds [default0]: total number of samples: 480297 [default0]: total number of epochs: 2 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002187 seconds [default0]: number of documents: 786245 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [746147, 785459) total of 39312 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_7713ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_7713ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_7713ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.004 seconds [default0]: total number of samples: 8487 [default0]: total number of epochs: 8 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002456 seconds [default0]: number of documents: 1748556 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [1659380, 1746807) total of 87427 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.031 seconds [default0]: total number of samples: 907157 [default0]: total number of epochs: 1 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.032000 seconds [default0]: number of documents: 29464287 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [27961608, 29434823) total of 1473215 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_175046ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_175046ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_175046ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.102 seconds [default0]: total number of samples: 186675 [default0]: total number of epochs: 12 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.007831 seconds [default0]: number of documents: 38304059 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [36350552, 38265755) total of 1915203 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_321113ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_321113ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_321113ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.120 seconds [default0]: total number of samples: 333733 [default0]: total number of epochs: 13 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.001766 seconds [default0]: number of documents: 729667 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [692454, 728937) total of 36483 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_82486ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_82486ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_82486ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.010 seconds [default0]: total number of samples: 98264 [default0]: total number of epochs: 5 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.038396 seconds [default0]: number of documents: 24265522 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [23027980, 24241256) total of 1213276 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_120711ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_120711ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_120711ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.071 seconds [default0]: total number of samples: 129080 [default0]: total number of epochs: 6 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.007623 seconds [default0]: number of documents: 9587455 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [9098495, 9577868) total of 479373 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_337372ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_337372ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_337372ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.027 seconds [default0]: total number of samples: 469042 [default0]: total number of epochs: 3 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.008754 seconds [default0]: number of documents: 4335929 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [4114797, 4331593) total of 216796 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.029 seconds [default0]: total number of samples: 398209 [default0]: total number of epochs: 2 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.003000 seconds [default0]: number of documents: 149731 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [142095, 149581) total of 7486 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_1473ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_1473ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_1473ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.002 seconds [default0]: total number of samples: 1544 [default0]: total number of epochs: 6 [default0]:> building indices for blendable datasets ... [default0]: > sample ratios: [default0]: dataset 0, input: 0.0870676, achieved: 0.0870675 [default0]: dataset 1, input: 0.0207314, achieved: 0.0207315 [default0]: dataset 2, input: 0.1247, achieved: 0.1247 [default0]: dataset 3, input: 0.124182, achieved: 0.124182 [default0]: dataset 4, input: 0.0029046, achieved: 0.00290461 [default0]: dataset 5, input: 0.1247, achieved: 0.1247 [default0]: dataset 6, input: 0.0659275, achieved: 0.0659274 [default0]: dataset 7, input: 0.120941, achieved: 0.120941 [default0]: dataset 8, input: 0.0310665, achieved: 0.0310665 [default0]: dataset 9, input: 0.0454631, achieved: 0.0454631 [default0]: dataset 10, input: 0.127064, achieved: 0.127064 [default0]: dataset 11, input: 0.1247, achieved: 0.1247 [default0]: dataset 12, input: 0.000554406, achieved: 0.000554525 [default0]:> elapsed time for building blendable dataset indices: 0.09 (sec) [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002837 seconds [default0]: number of documents: 1276214 [default0]: > dataset split: [default0]: test: [default0]: document indices in [1274938, 1276214) total of 1276 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1793ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1793ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1793ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.038 seconds [default0]: total number of samples: 202915 [default0]: total number of epochs: 1 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.001934 seconds [default0]: number of documents: 2218089 [default0]: > dataset split: [default0]: test: [default0]: document indices in [2215871, 2218089) total of 2218 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_427ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_427ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_427ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.002 seconds [default0]: total number of samples: 459 [default0]: total number of epochs: 13 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.001928 seconds [default0]: number of documents: 14716427 [default0]: > dataset split: [default0]: test: [default0]: document indices in [14701711, 14716427) total of 14716 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.004 seconds [default0]: total number of samples: 37487 [default0]: total number of epochs: 1 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.021217 seconds [default0]: number of documents: 2767535 [default0]: > dataset split: [default0]: test: [default0]: document indices in [2764767, 2767535) total of 2768 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2556ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2556ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2556ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.003 seconds [default0]: total number of samples: 9926 [default0]: total number of epochs: 1 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002356 seconds [default0]: number of documents: 786245 [default0]: > dataset split: [default0]: test: [default0]: document indices in [785459, 786245) total of 786 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_60ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_60ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_60ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.002 seconds [default0]: total number of samples: 79 [default0]: total number of epochs: 4 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002487 seconds [default0]: number of documents: 1748556 [default0]: > dataset split: [default0]: test: [default0]: document indices in [1746807, 1748556) total of 1749 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.003 seconds [default0]: total number of samples: 34096 [default0]: total number of epochs: 1 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002759 seconds [default0]: number of documents: 29464287 [default0]: > dataset split: [default0]: test: [default0]: document indices in [29434823, 29464287) total of 29464 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1357ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1357ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1357ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.003 seconds [default0]: total number of samples: 1645 [default0]: total number of epochs: 5 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.010488 seconds [default0]: number of documents: 38304059 [default0]: > dataset split: [default0]: test: [default0]: document indices in [38265755, 38304059) total of 38304 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2490ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2490ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2490ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.006 seconds [default0]: total number of samples: 2778 [default0]: total number of epochs: 5 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002211 seconds [default0]: number of documents: 729667 [default0]: > dataset split: [default0]: test: [default0]: document indices in [728937, 729667) total of 730 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_640ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_640ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_640ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.002 seconds [default0]: total number of samples: 716 [default0]: total number of epochs: 2 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.001865 seconds [default0]: number of documents: 24265522 [default0]: > dataset split: [default0]: test: [default0]: document indices in [24241256, 24265522) total of 24266 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_936ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_936ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_936ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.003 seconds [default0]: total number of samples: 1312 [default0]: total number of epochs: 3 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.008090 seconds [default0]: number of documents: 9587455 [default0]: > dataset split: [default0]: test: [default0]: document indices in [9577868, 9587455) total of 9587 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2616ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2616ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2616ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.003 seconds [default0]: total number of samples: 3324 [default0]: total number of epochs: 2 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002499 seconds [default0]: number of documents: 4335929 [default0]: > dataset split: [default0]: test: [default0]: document indices in [4331593, 4335929) total of 4336 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.005 seconds [default0]: total number of samples: 3964 [default0]: total number of epochs: 1 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.004362 seconds [default0]: number of documents: 149731 [default0]: > dataset split: [default0]: test: [default0]: document indices in [149581, 149731) total of 150 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_12ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_12ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_12ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.001 seconds [default0]: total number of samples: 15 [default0]: total number of epochs: 2 [default0]:> building indices for blendable datasets ... [default0]: > sample ratios: [default0]: dataset 0, input: 0.0870676, achieved: 0.0870664 [default0]: dataset 1, input: 0.0207314, achieved: 0.020733 [default0]: dataset 2, input: 0.1247, achieved: 0.124699 [default0]: dataset 3, input: 0.124182, achieved: 0.12418 [default0]: dataset 4, input: 0.0029046, achieved: 0.0029059 [default0]: dataset 5, input: 0.1247, achieved: 0.124699 [default0]: dataset 6, input: 0.0659275, achieved: 0.0659284 [default0]: dataset 7, input: 0.120941, achieved: 0.12094 [default0]: dataset 8, input: 0.0310665, achieved: 0.0310676 [default0]: dataset 9, input: 0.0454631, achieved: 0.0454632 [default0]: dataset 10, input: 0.127064, achieved: 0.127063 [default0]: dataset 11, input: 0.1247, achieved: 0.124699 [default0]: dataset 12, input: 0.000554406, achieved: 0.000555736 [default0]:> elapsed time for building blendable dataset indices: 0.01 (sec) [default0]:> finished creating GPT datasets ... [default1]:[001-002] 177.6021B / 177.6021B [default2]:[002-002] 177.6021B / 177.6021B [default3]:[003-002] 177.6021B / 177.6021B [default0]:[000-009] 177.6021B / 177.6021B [default0]:[000-003] 177.6021B / 177.6021B [default0]:[000-011] 191.1639B / 148.0045B [default0]:[000-010] 177.6021B / 177.6021B [default3]:[003-009] 177.6021B / 177.6021B [default2]:[002-000] 191.1625B / 148.0031B [default0]:[000-002] 177.6021B / 177.6021B [default3]:[003-003] 177.6021B / 177.6021B [default1]:[001-003] 177.6021B / 177.6021B [default2]:[002-009] 177.6021B / 177.6021B [default1]:[001-009] 177.6021B / 177.6021B [default1]:[001-011] 191.1639B / 148.0045B [default0]:[000-001] 177.6021B / 177.6021B [default3]:[003-001] 177.6021B / 177.6021B [default2]:[002-010] 177.6021B / 177.6021B [default1]:[001-000] 191.1625B / 148.0031B [default3]:[003-010] 177.6021B / 177.6021B [default1]:[001-010] 177.6021B / 177.6021B [default0]:[000-007] 177.6021B / 177.6021B [default1]:[001-007] 177.6021B / 177.6021B [default3]:[003-007] 177.6021B / 177.6021B [default0]:[after dataloaders are built] datetime: 2022-03-03 06:06:03 [default0]:done with setup ... [default0]:training ... [default0]:Number of parameters: [tensor rank - pipeline rank] w/ and w/o embeddings: [default0]:[000-000] 191.1625B / 148.0031B [default0]:[before the start of training step] datetime: 2022-03-03 06:06:03 [default3]:[003-008] 177.6021B / 177.6021B [default0]:[000-006] 177.6021B / 177.6021B [default2]:[002-004] 177.6021B / 177.6021B [default2]:[002-007] 177.6021B / 177.6021B [default2]:[002-003] 177.6021B / 177.6021B [default2]:[002-001] 177.6021B / 177.6021B [default1]:[001-001] 177.6021B / 177.6021B [default3]:[003-004] 177.6021B / 177.6021B [default2]:[002-011] 191.1639B / 148.0045B [default1]:[001-008] 177.6021B / 177.6021B [default3]:[003-011] 191.1639B / 148.0045B [default0]:[000-008] 177.6021B / 177.6021B [default2]:[002-008] 177.6021B / 177.6021B [default0]:[000-004] 177.6021B / 177.6021B [default1]:[001-006] 177.6021B / 177.6021B [default2]:[002-006] 177.6021B / 177.6021B [default2]:[002-005] 177.6021B / 177.6021B [default1]:[001-005] 177.6021B / 177.6021B [default3]:[003-005] 177.6021B / 177.6021B [default1]:[001-004] 177.6021B / 177.6021B [default3]:[003-006] 177.6021B / 177.6021B [default7]:time (ms) | model-and-optimizer-setup: 33538.23 | train/valid/test-data-iterators-setup: 13018.50 [default3]:[003-000] 191.1625B / 148.0031B [default0]:[000-005] 177.6021B / 177.6021B [default0]:[2022-03-03 06:06:03,406] [INFO] [checkpointing.py:547:forward] Activation Checkpointing Information [default0]:[2022-03-03 06:06:03,406] [INFO] [checkpointing.py:548:forward] ----Partition Activations False, CPU CHECKPOINTING False [default0]:[2022-03-03 06:06:03,407] [INFO] [checkpointing.py:551:forward] ----contiguous Memory Checkpointing False with 70 total layers [default0]:[2022-03-03 06:06:03,407] [INFO] [checkpointing.py:554:forward] ----Synchronization False [default0]:[2022-03-03 06:06:03,407] [INFO] [checkpointing.py:555:forward] ----Profiling time in checkpointing False [default3]:[Rank 67] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default3]:[Rank 99] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default3]:[Rank 35] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default3]:[Rank 291] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default3]:[Rank 323] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default3]:[Rank 227] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default3]:[Rank 131] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default3]:[Rank 355] (after 51 iterations) memory (MB) | allocated: 29724.1103515625 | max allocated: 41683.2236328125 | reserved: 48348.0 | max reserved: 48348.0 [default3]:[Rank 259] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default3]:[Rank 195] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default3]:[Rank 3] (after 51 iterations) memory (MB) | allocated: 28523.97509765625 | max allocated: 40483.08837890625 | reserved: 48348.0 | max reserved: 48348.0 [default3]:[Rank 163] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default7]: iteration 51/ 128728 | consumed samples: 816 | consumed tokens: 1671168 | elapsed time per iteration (s): 39.93 | learning rate: 2.674E-07 | global batch size: 16 | lm loss: 1.196962E+01 | grad norm: 2.520 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 0.401 | TFLOPs: 3.07 | [default1]:[Rank 65] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default0]:[Rank 288] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default0]:[Rank 96] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default1]:[Rank 97] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default0]:[Rank 320] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default1]:[Rank 353] (after 51 iterations) memory (MB) | allocated: 29724.1103515625 | max allocated: 41683.2236328125 | reserved: 48348.0 | max reserved: 48348.0 [default2]:[Rank 2] (after 51 iterations) memory (MB) | allocated: 28523.97509765625 | max allocated: 40483.08837890625 | reserved: 48348.0 | max reserved: 48348.0 [default0]:[Rank 64] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default2]:[Rank 290] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default1]:[Rank 289] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default2]:[Rank 322] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default0]:[Rank 32] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default0]:[Rank 224] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default0]:[Rank 352] (after 51 iterations) memory (MB) | allocated: 29724.1103515625 | max allocated: 41683.2236328125 | reserved: 48348.0 | max reserved: 48348.0 [default0]:[Rank 0] (after 51 iterations) memory (MB) | allocated: 28523.97509765625 | max allocated: 40483.08837890625 | reserved: 48348.0 | max reserved: 48348.0 [default1]:[Rank 225] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default2]:[Rank 98] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default0]:[Rank 192] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default1]:[Rank 321] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default2]:[Rank 226] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default1]:[Rank 1] (after 51 iterations) memory (MB) | allocated: 28523.97509765625 | max allocated: 40483.08837890625 | reserved: 48348.0 | max reserved: 48348.0 [default2]:[Rank 130] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default2]:[Rank 354] (after 51 iterations) memory (MB) | allocated: 29724.1103515625 | max allocated: 41683.2236328125 | reserved: 48348.0 | max reserved: 48348.0 [default1]:[Rank 33] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default0]:[Rank 128] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default1]:[Rank 193] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default2]:[Rank 194] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default1]:[Rank 257] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default1]:[Rank 161] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default2]:[Rank 162] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default1]:[Rank 129] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default2]:[Rank 34] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default0]:[Rank 256] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default2]:[Rank 258] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default0]:[Rank 160] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default2]:[Rank 66] (after 51 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default7]: iteration 52/ 128728 | consumed samples: 832 | consumed tokens: 1703936 | elapsed time per iteration (s): 14.92 | learning rate: 2.726E-07 | global batch size: 16 | lm loss: 1.192162E+01 | grad norm: 2.002 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.072 | TFLOPs: 8.21 | [default7]: iteration 53/ 128728 | consumed samples: 848 | consumed tokens: 1736704 | elapsed time per iteration (s): 15.24 | learning rate: 2.779E-07 | global batch size: 16 | lm loss: 1.202632E+01 | grad norm: 1.787 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 54/ 128728 | consumed samples: 864 | consumed tokens: 1769472 | elapsed time per iteration (s): 15.19 | learning rate: 2.831E-07 | global batch size: 16 | lm loss: 1.187102E+01 | grad norm: 2.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 55/ 128728 | consumed samples: 880 | consumed tokens: 1802240 | elapsed time per iteration (s): 15.20 | learning rate: 2.884E-07 | global batch size: 16 | lm loss: 1.191143E+01 | grad norm: 2.075 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 56/ 128728 | consumed samples: 896 | consumed tokens: 1835008 | elapsed time per iteration (s): 15.21 | learning rate: 2.936E-07 | global batch size: 16 | lm loss: 1.189511E+01 | grad norm: 1.821 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 57/ 128728 | consumed samples: 912 | consumed tokens: 1867776 | elapsed time per iteration (s): 15.18 | learning rate: 2.988E-07 | global batch size: 16 | lm loss: 1.175074E+01 | grad norm: 2.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 58/ 128728 | consumed samples: 928 | consumed tokens: 1900544 | elapsed time per iteration (s): 15.18 | learning rate: 3.041E-07 | global batch size: 16 | lm loss: 1.181468E+01 | grad norm: 1.838 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 59/ 128728 | consumed samples: 944 | consumed tokens: 1933312 | elapsed time per iteration (s): 15.21 | learning rate: 3.093E-07 | global batch size: 16 | lm loss: 1.167815E+01 | grad norm: 1.872 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 60/ 128728 | consumed samples: 960 | consumed tokens: 1966080 | elapsed time per iteration (s): 15.20 | learning rate: 3.146E-07 | global batch size: 16 | lm loss: 1.176816E+01 | grad norm: 1.590 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 61/ 128728 | consumed samples: 976 | consumed tokens: 1998848 | elapsed time per iteration (s): 15.20 | learning rate: 3.198E-07 | global batch size: 16 | lm loss: 1.160849E+01 | grad norm: 1.616 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 62/ 128728 | consumed samples: 992 | consumed tokens: 2031616 | elapsed time per iteration (s): 15.19 | learning rate: 3.251E-07 | global batch size: 16 | lm loss: 1.165278E+01 | grad norm: 1.415 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 63/ 128728 | consumed samples: 1008 | consumed tokens: 2064384 | elapsed time per iteration (s): 15.23 | learning rate: 3.303E-07 | global batch size: 16 | lm loss: 1.162152E+01 | grad norm: 1.387 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 64/ 128728 | consumed samples: 1024 | consumed tokens: 2097152 | elapsed time per iteration (s): 15.20 | learning rate: 3.355E-07 | global batch size: 16 | lm loss: 1.163912E+01 | grad norm: 1.323 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 65/ 128728 | consumed samples: 1040 | consumed tokens: 2129920 | elapsed time per iteration (s): 15.19 | learning rate: 3.408E-07 | global batch size: 16 | lm loss: 1.152941E+01 | grad norm: 1.425 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 66/ 128728 | consumed samples: 1056 | consumed tokens: 2162688 | elapsed time per iteration (s): 15.22 | learning rate: 3.460E-07 | global batch size: 16 | lm loss: 1.144800E+01 | grad norm: 1.255 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 67/ 128728 | consumed samples: 1072 | consumed tokens: 2195456 | elapsed time per iteration (s): 15.18 | learning rate: 3.513E-07 | global batch size: 16 | lm loss: 1.142246E+01 | grad norm: 1.453 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 68/ 128728 | consumed samples: 1088 | consumed tokens: 2228224 | elapsed time per iteration (s): 15.22 | learning rate: 3.565E-07 | global batch size: 16 | lm loss: 1.147447E+01 | grad norm: 1.255 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 69/ 128728 | consumed samples: 1104 | consumed tokens: 2260992 | elapsed time per iteration (s): 15.23 | learning rate: 3.618E-07 | global batch size: 16 | lm loss: 1.132389E+01 | grad norm: 1.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 70/ 128728 | consumed samples: 1120 | consumed tokens: 2293760 | elapsed time per iteration (s): 15.19 | learning rate: 3.670E-07 | global batch size: 16 | lm loss: 1.135389E+01 | grad norm: 1.069 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 71/ 128728 | consumed samples: 1136 | consumed tokens: 2326528 | elapsed time per iteration (s): 15.28 | learning rate: 3.722E-07 | global batch size: 16 | lm loss: 1.143639E+01 | grad norm: 0.900 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 72/ 128728 | consumed samples: 1152 | consumed tokens: 2359296 | elapsed time per iteration (s): 15.19 | learning rate: 3.775E-07 | global batch size: 16 | lm loss: 1.144752E+01 | grad norm: 1.220 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 73/ 128728 | consumed samples: 1168 | consumed tokens: 2392064 | elapsed time per iteration (s): 15.17 | learning rate: 3.827E-07 | global batch size: 16 | lm loss: 1.136817E+01 | grad norm: 0.915 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 74/ 128728 | consumed samples: 1184 | consumed tokens: 2424832 | elapsed time per iteration (s): 15.22 | learning rate: 3.880E-07 | global batch size: 16 | lm loss: 1.132335E+01 | grad norm: 1.098 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 75/ 128728 | consumed samples: 1200 | consumed tokens: 2457600 | elapsed time per iteration (s): 15.16 | learning rate: 3.932E-07 | global batch size: 16 | lm loss: 1.124673E+01 | grad norm: 1.096 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 76/ 128728 | consumed samples: 1216 | consumed tokens: 2490368 | elapsed time per iteration (s): 15.18 | learning rate: 3.985E-07 | global batch size: 16 | lm loss: 1.127481E+01 | grad norm: 0.979 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 77/ 128728 | consumed samples: 1232 | consumed tokens: 2523136 | elapsed time per iteration (s): 15.21 | learning rate: 4.037E-07 | global batch size: 16 | lm loss: 1.117865E+01 | grad norm: 0.977 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 78/ 128728 | consumed samples: 1248 | consumed tokens: 2555904 | elapsed time per iteration (s): 15.18 | learning rate: 4.089E-07 | global batch size: 16 | lm loss: 1.130504E+01 | grad norm: 1.082 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 79/ 128728 | consumed samples: 1264 | consumed tokens: 2588672 | elapsed time per iteration (s): 15.17 | learning rate: 4.142E-07 | global batch size: 16 | lm loss: 1.125540E+01 | grad norm: 1.030 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 80/ 128728 | consumed samples: 1280 | consumed tokens: 2621440 | elapsed time per iteration (s): 15.19 | learning rate: 4.194E-07 | global batch size: 16 | lm loss: 1.120402E+01 | grad norm: 0.910 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 81/ 128728 | consumed samples: 1296 | consumed tokens: 2654208 | elapsed time per iteration (s): 15.18 | learning rate: 4.247E-07 | global batch size: 16 | lm loss: 1.119429E+01 | grad norm: 0.844 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 82/ 128728 | consumed samples: 1312 | consumed tokens: 2686976 | elapsed time per iteration (s): 15.17 | learning rate: 4.299E-07 | global batch size: 16 | lm loss: 1.111624E+01 | grad norm: 0.821 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 83/ 128728 | consumed samples: 1328 | consumed tokens: 2719744 | elapsed time per iteration (s): 15.16 | learning rate: 4.352E-07 | global batch size: 16 | lm loss: 1.117877E+01 | grad norm: 0.824 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 84/ 128728 | consumed samples: 1344 | consumed tokens: 2752512 | elapsed time per iteration (s): 15.20 | learning rate: 4.404E-07 | global batch size: 16 | lm loss: 1.093013E+01 | grad norm: 1.069 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 85/ 128728 | consumed samples: 1360 | consumed tokens: 2785280 | elapsed time per iteration (s): 15.16 | learning rate: 4.456E-07 | global batch size: 16 | lm loss: 1.098155E+01 | grad norm: 0.983 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 86/ 128728 | consumed samples: 1376 | consumed tokens: 2818048 | elapsed time per iteration (s): 15.18 | learning rate: 4.509E-07 | global batch size: 16 | lm loss: 1.111607E+01 | grad norm: 0.766 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 87/ 128728 | consumed samples: 1392 | consumed tokens: 2850816 | elapsed time per iteration (s): 15.15 | learning rate: 4.561E-07 | global batch size: 16 | lm loss: 1.092821E+01 | grad norm: 0.928 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 88/ 128728 | consumed samples: 1408 | consumed tokens: 2883584 | elapsed time per iteration (s): 15.17 | learning rate: 4.614E-07 | global batch size: 16 | lm loss: 1.108350E+01 | grad norm: 0.890 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 89/ 128728 | consumed samples: 1424 | consumed tokens: 2916352 | elapsed time per iteration (s): 15.18 | learning rate: 4.666E-07 | global batch size: 16 | lm loss: 1.089155E+01 | grad norm: 0.878 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 90/ 128728 | consumed samples: 1440 | consumed tokens: 2949120 | elapsed time per iteration (s): 15.19 | learning rate: 4.719E-07 | global batch size: 16 | lm loss: 1.096077E+01 | grad norm: 0.628 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 91/ 128728 | consumed samples: 1456 | consumed tokens: 2981888 | elapsed time per iteration (s): 15.15 | learning rate: 4.771E-07 | global batch size: 16 | lm loss: 1.101388E+01 | grad norm: 0.903 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 92/ 128728 | consumed samples: 1472 | consumed tokens: 3014656 | elapsed time per iteration (s): 15.12 | learning rate: 4.823E-07 | global batch size: 16 | lm loss: 1.093092E+01 | grad norm: 0.966 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.059 | TFLOPs: 8.10 | [default7]: iteration 93/ 128728 | consumed samples: 1488 | consumed tokens: 3047424 | elapsed time per iteration (s): 15.17 | learning rate: 4.876E-07 | global batch size: 16 | lm loss: 1.113160E+01 | grad norm: 1.102 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 94/ 128728 | consumed samples: 1504 | consumed tokens: 3080192 | elapsed time per iteration (s): 15.18 | learning rate: 4.928E-07 | global batch size: 16 | lm loss: 1.098779E+01 | grad norm: 0.685 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 95/ 128728 | consumed samples: 1520 | consumed tokens: 3112960 | elapsed time per iteration (s): 15.19 | learning rate: 4.981E-07 | global batch size: 16 | lm loss: 1.095967E+01 | grad norm: 0.787 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 96/ 128728 | consumed samples: 1536 | consumed tokens: 3145728 | elapsed time per iteration (s): 15.21 | learning rate: 5.033E-07 | global batch size: 16 | lm loss: 1.094612E+01 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 97/ 128728 | consumed samples: 1552 | consumed tokens: 3178496 | elapsed time per iteration (s): 15.18 | learning rate: 5.086E-07 | global batch size: 16 | lm loss: 1.087047E+01 | grad norm: 0.860 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 98/ 128728 | consumed samples: 1568 | consumed tokens: 3211264 | elapsed time per iteration (s): 15.19 | learning rate: 5.138E-07 | global batch size: 16 | lm loss: 1.096809E+01 | grad norm: 0.918 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 99/ 128728 | consumed samples: 1584 | consumed tokens: 3244032 | elapsed time per iteration (s): 15.19 | learning rate: 5.190E-07 | global batch size: 16 | lm loss: 1.106409E+01 | grad norm: 0.905 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 100/ 128728 | consumed samples: 1600 | consumed tokens: 3276800 | elapsed time per iteration (s): 15.18 | learning rate: 5.243E-07 | global batch size: 16 | lm loss: 1.086620E+01 | grad norm: 0.878 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 101/ 128728 | consumed samples: 1616 | consumed tokens: 3309568 | elapsed time per iteration (s): 15.20 | learning rate: 5.295E-07 | global batch size: 16 | lm loss: 1.089338E+01 | grad norm: 0.825 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 102/ 128728 | consumed samples: 1632 | consumed tokens: 3342336 | elapsed time per iteration (s): 15.17 | learning rate: 5.348E-07 | global batch size: 16 | lm loss: 1.075126E+01 | grad norm: 0.765 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 103/ 128728 | consumed samples: 1648 | consumed tokens: 3375104 | elapsed time per iteration (s): 15.17 | learning rate: 5.400E-07 | global batch size: 16 | lm loss: 1.086857E+01 | grad norm: 0.994 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 104/ 128728 | consumed samples: 1664 | consumed tokens: 3407872 | elapsed time per iteration (s): 15.19 | learning rate: 5.453E-07 | global batch size: 16 | lm loss: 1.076913E+01 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 105/ 128728 | consumed samples: 1680 | consumed tokens: 3440640 | elapsed time per iteration (s): 15.18 | learning rate: 5.505E-07 | global batch size: 16 | lm loss: 1.085386E+01 | grad norm: 0.726 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 106/ 128728 | consumed samples: 1696 | consumed tokens: 3473408 | elapsed time per iteration (s): 15.17 | learning rate: 5.557E-07 | global batch size: 16 | lm loss: 1.088430E+01 | grad norm: 0.710 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 107/ 128728 | consumed samples: 1712 | consumed tokens: 3506176 | elapsed time per iteration (s): 15.21 | learning rate: 5.610E-07 | global batch size: 16 | lm loss: 1.077884E+01 | grad norm: 0.821 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 108/ 128728 | consumed samples: 1728 | consumed tokens: 3538944 | elapsed time per iteration (s): 15.20 | learning rate: 5.662E-07 | global batch size: 16 | lm loss: 1.084765E+01 | grad norm: 0.716 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 109/ 128728 | consumed samples: 1744 | consumed tokens: 3571712 | elapsed time per iteration (s): 15.18 | learning rate: 5.715E-07 | global batch size: 16 | lm loss: 1.084685E+01 | grad norm: 0.903 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 110/ 128728 | consumed samples: 1760 | consumed tokens: 3604480 | elapsed time per iteration (s): 15.19 | learning rate: 5.767E-07 | global batch size: 16 | lm loss: 1.077808E+01 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 111/ 128728 | consumed samples: 1776 | consumed tokens: 3637248 | elapsed time per iteration (s): 15.21 | learning rate: 5.820E-07 | global batch size: 16 | lm loss: 1.084661E+01 | grad norm: 0.739 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 112/ 128728 | consumed samples: 1792 | consumed tokens: 3670016 | elapsed time per iteration (s): 15.17 | learning rate: 5.872E-07 | global batch size: 16 | lm loss: 1.073598E+01 | grad norm: 0.772 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 113/ 128728 | consumed samples: 1808 | consumed tokens: 3702784 | elapsed time per iteration (s): 15.19 | learning rate: 5.924E-07 | global batch size: 16 | lm loss: 1.073445E+01 | grad norm: 0.671 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 114/ 128728 | consumed samples: 1824 | consumed tokens: 3735552 | elapsed time per iteration (s): 15.20 | learning rate: 5.977E-07 | global batch size: 16 | lm loss: 1.084661E+01 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 115/ 128728 | consumed samples: 1840 | consumed tokens: 3768320 | elapsed time per iteration (s): 15.22 | learning rate: 6.029E-07 | global batch size: 16 | lm loss: 1.072918E+01 | grad norm: 0.771 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 116/ 128728 | consumed samples: 1856 | consumed tokens: 3801088 | elapsed time per iteration (s): 15.18 | learning rate: 6.082E-07 | global batch size: 16 | lm loss: 1.071857E+01 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 117/ 128728 | consumed samples: 1872 | consumed tokens: 3833856 | elapsed time per iteration (s): 15.20 | learning rate: 6.134E-07 | global batch size: 16 | lm loss: 1.081528E+01 | grad norm: 0.648 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 118/ 128728 | consumed samples: 1888 | consumed tokens: 3866624 | elapsed time per iteration (s): 15.21 | learning rate: 6.187E-07 | global batch size: 16 | lm loss: 1.083505E+01 | grad norm: 0.813 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 119/ 128728 | consumed samples: 1904 | consumed tokens: 3899392 | elapsed time per iteration (s): 15.12 | learning rate: 6.239E-07 | global batch size: 16 | lm loss: 1.081624E+01 | grad norm: 0.696 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.058 | TFLOPs: 8.10 | [default7]: iteration 120/ 128728 | consumed samples: 1920 | consumed tokens: 3932160 | elapsed time per iteration (s): 15.15 | learning rate: 6.291E-07 | global batch size: 16 | lm loss: 1.068017E+01 | grad norm: 0.876 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 121/ 128728 | consumed samples: 1936 | consumed tokens: 3964928 | elapsed time per iteration (s): 15.19 | learning rate: 6.344E-07 | global batch size: 16 | lm loss: 1.087509E+01 | grad norm: 0.834 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 122/ 128728 | consumed samples: 1952 | consumed tokens: 3997696 | elapsed time per iteration (s): 15.20 | learning rate: 6.396E-07 | global batch size: 16 | lm loss: 1.068378E+01 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 123/ 128728 | consumed samples: 1968 | consumed tokens: 4030464 | elapsed time per iteration (s): 15.20 | learning rate: 6.449E-07 | global batch size: 16 | lm loss: 1.059418E+01 | grad norm: 0.798 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 124/ 128728 | consumed samples: 1984 | consumed tokens: 4063232 | elapsed time per iteration (s): 15.20 | learning rate: 6.501E-07 | global batch size: 16 | lm loss: 1.072522E+01 | grad norm: 0.832 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 125/ 128728 | consumed samples: 2000 | consumed tokens: 4096000 | elapsed time per iteration (s): 15.16 | learning rate: 6.554E-07 | global batch size: 16 | lm loss: 1.064985E+01 | grad norm: 0.783 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 126/ 128728 | consumed samples: 2016 | consumed tokens: 4128768 | elapsed time per iteration (s): 15.21 | learning rate: 6.606E-07 | global batch size: 16 | lm loss: 1.092184E+01 | grad norm: 1.960 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 127/ 128728 | consumed samples: 2032 | consumed tokens: 4161536 | elapsed time per iteration (s): 15.18 | learning rate: 6.658E-07 | global batch size: 16 | lm loss: 1.067953E+01 | grad norm: 0.726 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 128/ 128728 | consumed samples: 2048 | consumed tokens: 4194304 | elapsed time per iteration (s): 15.19 | learning rate: 6.711E-07 | global batch size: 16 | lm loss: 1.074638E+01 | grad norm: 1.256 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 129/ 128728 | consumed samples: 2064 | consumed tokens: 4227072 | elapsed time per iteration (s): 15.18 | learning rate: 6.763E-07 | global batch size: 16 | lm loss: 1.075598E+01 | grad norm: 0.772 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 130/ 128728 | consumed samples: 2080 | consumed tokens: 4259840 | elapsed time per iteration (s): 15.17 | learning rate: 6.816E-07 | global batch size: 16 | lm loss: 1.073375E+01 | grad norm: 0.778 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 131/ 128728 | consumed samples: 2096 | consumed tokens: 4292608 | elapsed time per iteration (s): 15.21 | learning rate: 6.868E-07 | global batch size: 16 | lm loss: 1.056206E+01 | grad norm: 0.889 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 132/ 128728 | consumed samples: 2112 | consumed tokens: 4325376 | elapsed time per iteration (s): 15.17 | learning rate: 6.921E-07 | global batch size: 16 | lm loss: 1.071002E+01 | grad norm: 0.819 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 133/ 128728 | consumed samples: 2128 | consumed tokens: 4358144 | elapsed time per iteration (s): 15.22 | learning rate: 6.973E-07 | global batch size: 16 | lm loss: 1.081503E+01 | grad norm: 0.884 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 134/ 128728 | consumed samples: 2144 | consumed tokens: 4390912 | elapsed time per iteration (s): 15.16 | learning rate: 7.025E-07 | global batch size: 16 | lm loss: 1.042019E+01 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 135/ 128728 | consumed samples: 2160 | consumed tokens: 4423680 | elapsed time per iteration (s): 15.20 | learning rate: 7.078E-07 | global batch size: 16 | lm loss: 1.065207E+01 | grad norm: 0.810 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 136/ 128728 | consumed samples: 2176 | consumed tokens: 4456448 | elapsed time per iteration (s): 15.17 | learning rate: 7.130E-07 | global batch size: 16 | lm loss: 1.066140E+01 | grad norm: 0.664 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 137/ 128728 | consumed samples: 2192 | consumed tokens: 4489216 | elapsed time per iteration (s): 15.21 | learning rate: 7.183E-07 | global batch size: 16 | lm loss: 1.060394E+01 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 138/ 128728 | consumed samples: 2208 | consumed tokens: 4521984 | elapsed time per iteration (s): 15.14 | learning rate: 7.235E-07 | global batch size: 16 | lm loss: 1.051196E+01 | grad norm: 1.103 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 139/ 128728 | consumed samples: 2224 | consumed tokens: 4554752 | elapsed time per iteration (s): 15.15 | learning rate: 7.288E-07 | global batch size: 16 | lm loss: 1.058902E+01 | grad norm: 0.805 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 140/ 128728 | consumed samples: 2240 | consumed tokens: 4587520 | elapsed time per iteration (s): 15.19 | learning rate: 7.340E-07 | global batch size: 16 | lm loss: 1.060271E+01 | grad norm: 1.084 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 141/ 128728 | consumed samples: 2256 | consumed tokens: 4620288 | elapsed time per iteration (s): 15.20 | learning rate: 7.392E-07 | global batch size: 16 | lm loss: 1.046633E+01 | grad norm: 0.648 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 142/ 128728 | consumed samples: 2272 | consumed tokens: 4653056 | elapsed time per iteration (s): 15.20 | learning rate: 7.445E-07 | global batch size: 16 | lm loss: 1.055144E+01 | grad norm: 0.721 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 143/ 128728 | consumed samples: 2288 | consumed tokens: 4685824 | elapsed time per iteration (s): 15.16 | learning rate: 7.497E-07 | global batch size: 16 | lm loss: 1.071862E+01 | grad norm: 0.967 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 144/ 128728 | consumed samples: 2304 | consumed tokens: 4718592 | elapsed time per iteration (s): 15.21 | learning rate: 7.550E-07 | global batch size: 16 | lm loss: 1.053111E+01 | grad norm: 0.905 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 145/ 128728 | consumed samples: 2320 | consumed tokens: 4751360 | elapsed time per iteration (s): 15.21 | learning rate: 7.602E-07 | global batch size: 16 | lm loss: 1.067661E+01 | grad norm: 0.909 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 146/ 128728 | consumed samples: 2336 | consumed tokens: 4784128 | elapsed time per iteration (s): 15.21 | learning rate: 7.655E-07 | global batch size: 16 | lm loss: 1.046909E+01 | grad norm: 0.800 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 147/ 128728 | consumed samples: 2352 | consumed tokens: 4816896 | elapsed time per iteration (s): 15.20 | learning rate: 7.707E-07 | global batch size: 16 | lm loss: 1.068971E+01 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 148/ 128728 | consumed samples: 2368 | consumed tokens: 4849664 | elapsed time per iteration (s): 15.17 | learning rate: 7.759E-07 | global batch size: 16 | lm loss: 1.061832E+01 | grad norm: 0.775 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 149/ 128728 | consumed samples: 2384 | consumed tokens: 4882432 | elapsed time per iteration (s): 15.17 | learning rate: 7.812E-07 | global batch size: 16 | lm loss: 1.059798E+01 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 150/ 128728 | consumed samples: 2400 | consumed tokens: 4915200 | elapsed time per iteration (s): 15.17 | learning rate: 7.864E-07 | global batch size: 16 | lm loss: 1.044703E+01 | grad norm: 0.850 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 151/ 128728 | consumed samples: 2416 | consumed tokens: 4947968 | elapsed time per iteration (s): 15.17 | learning rate: 7.917E-07 | global batch size: 16 | lm loss: 1.052176E+01 | grad norm: 0.879 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 152/ 128728 | consumed samples: 2432 | consumed tokens: 4980736 | elapsed time per iteration (s): 15.17 | learning rate: 7.969E-07 | global batch size: 16 | lm loss: 1.056132E+01 | grad norm: 0.632 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 153/ 128728 | consumed samples: 2448 | consumed tokens: 5013504 | elapsed time per iteration (s): 15.20 | learning rate: 8.022E-07 | global batch size: 16 | lm loss: 1.038216E+01 | grad norm: 0.668 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 154/ 128728 | consumed samples: 2464 | consumed tokens: 5046272 | elapsed time per iteration (s): 15.16 | learning rate: 8.074E-07 | global batch size: 16 | lm loss: 1.059594E+01 | grad norm: 1.109 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 155/ 128728 | consumed samples: 2480 | consumed tokens: 5079040 | elapsed time per iteration (s): 15.20 | learning rate: 8.126E-07 | global batch size: 16 | lm loss: 1.039668E+01 | grad norm: 0.626 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 156/ 128728 | consumed samples: 2496 | consumed tokens: 5111808 | elapsed time per iteration (s): 15.19 | learning rate: 8.179E-07 | global batch size: 16 | lm loss: 1.041435E+01 | grad norm: 0.700 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 157/ 128728 | consumed samples: 2512 | consumed tokens: 5144576 | elapsed time per iteration (s): 15.22 | learning rate: 8.231E-07 | global batch size: 16 | lm loss: 1.060662E+01 | grad norm: 1.020 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 158/ 128728 | consumed samples: 2528 | consumed tokens: 5177344 | elapsed time per iteration (s): 15.21 | learning rate: 8.284E-07 | global batch size: 16 | lm loss: 1.036032E+01 | grad norm: 1.006 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 159/ 128728 | consumed samples: 2544 | consumed tokens: 5210112 | elapsed time per iteration (s): 15.20 | learning rate: 8.336E-07 | global batch size: 16 | lm loss: 1.040484E+01 | grad norm: 0.961 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 160/ 128728 | consumed samples: 2560 | consumed tokens: 5242880 | elapsed time per iteration (s): 15.21 | learning rate: 8.389E-07 | global batch size: 16 | lm loss: 1.053427E+01 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 161/ 128728 | consumed samples: 2576 | consumed tokens: 5275648 | elapsed time per iteration (s): 15.21 | learning rate: 8.441E-07 | global batch size: 16 | lm loss: 1.045372E+01 | grad norm: 1.023 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 162/ 128728 | consumed samples: 2592 | consumed tokens: 5308416 | elapsed time per iteration (s): 15.24 | learning rate: 8.493E-07 | global batch size: 16 | lm loss: 1.044134E+01 | grad norm: 1.012 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 163/ 128728 | consumed samples: 2608 | consumed tokens: 5341184 | elapsed time per iteration (s): 15.21 | learning rate: 8.546E-07 | global batch size: 16 | lm loss: 1.037730E+01 | grad norm: 0.840 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 164/ 128728 | consumed samples: 2624 | consumed tokens: 5373952 | elapsed time per iteration (s): 15.19 | learning rate: 8.598E-07 | global batch size: 16 | lm loss: 1.037023E+01 | grad norm: 0.930 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 165/ 128728 | consumed samples: 2640 | consumed tokens: 5406720 | elapsed time per iteration (s): 15.22 | learning rate: 8.651E-07 | global batch size: 16 | lm loss: 1.033101E+01 | grad norm: 1.026 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 166/ 128728 | consumed samples: 2656 | consumed tokens: 5439488 | elapsed time per iteration (s): 15.21 | learning rate: 8.703E-07 | global batch size: 16 | lm loss: 1.036347E+01 | grad norm: 0.697 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 167/ 128728 | consumed samples: 2672 | consumed tokens: 5472256 | elapsed time per iteration (s): 15.21 | learning rate: 8.756E-07 | global batch size: 16 | lm loss: 1.042902E+01 | grad norm: 0.788 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 168/ 128728 | consumed samples: 2688 | consumed tokens: 5505024 | elapsed time per iteration (s): 15.23 | learning rate: 8.808E-07 | global batch size: 16 | lm loss: 1.027396E+01 | grad norm: 1.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 169/ 128728 | consumed samples: 2704 | consumed tokens: 5537792 | elapsed time per iteration (s): 15.14 | learning rate: 8.860E-07 | global batch size: 16 | lm loss: 1.033432E+01 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 170/ 128728 | consumed samples: 2720 | consumed tokens: 5570560 | elapsed time per iteration (s): 15.22 | learning rate: 8.913E-07 | global batch size: 16 | lm loss: 1.026951E+01 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 171/ 128728 | consumed samples: 2736 | consumed tokens: 5603328 | elapsed time per iteration (s): 15.18 | learning rate: 8.965E-07 | global batch size: 16 | lm loss: 1.017828E+01 | grad norm: 2.324 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 172/ 128728 | consumed samples: 2752 | consumed tokens: 5636096 | elapsed time per iteration (s): 15.18 | learning rate: 9.018E-07 | global batch size: 16 | lm loss: 1.032809E+01 | grad norm: 0.777 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 173/ 128728 | consumed samples: 2768 | consumed tokens: 5668864 | elapsed time per iteration (s): 15.18 | learning rate: 9.070E-07 | global batch size: 16 | lm loss: 1.033579E+01 | grad norm: 0.922 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 174/ 128728 | consumed samples: 2784 | consumed tokens: 5701632 | elapsed time per iteration (s): 15.20 | learning rate: 9.123E-07 | global batch size: 16 | lm loss: 1.006872E+01 | grad norm: 0.939 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 175/ 128728 | consumed samples: 2800 | consumed tokens: 5734400 | elapsed time per iteration (s): 15.22 | learning rate: 9.175E-07 | global batch size: 16 | lm loss: 1.022235E+01 | grad norm: 0.792 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 176/ 128728 | consumed samples: 2816 | consumed tokens: 5767168 | elapsed time per iteration (s): 15.21 | learning rate: 9.227E-07 | global batch size: 16 | lm loss: 1.020374E+01 | grad norm: 0.833 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 177/ 128728 | consumed samples: 2832 | consumed tokens: 5799936 | elapsed time per iteration (s): 15.17 | learning rate: 9.280E-07 | global batch size: 16 | lm loss: 1.014564E+01 | grad norm: 0.786 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 178/ 128728 | consumed samples: 2848 | consumed tokens: 5832704 | elapsed time per iteration (s): 15.21 | learning rate: 9.332E-07 | global batch size: 16 | lm loss: 1.032068E+01 | grad norm: 1.243 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 179/ 128728 | consumed samples: 2864 | consumed tokens: 5865472 | elapsed time per iteration (s): 15.19 | learning rate: 9.385E-07 | global batch size: 16 | lm loss: 1.024278E+01 | grad norm: 0.870 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 180/ 128728 | consumed samples: 2880 | consumed tokens: 5898240 | elapsed time per iteration (s): 15.19 | learning rate: 9.437E-07 | global batch size: 16 | lm loss: 1.029474E+01 | grad norm: 0.598 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 181/ 128728 | consumed samples: 2896 | consumed tokens: 5931008 | elapsed time per iteration (s): 15.19 | learning rate: 9.490E-07 | global batch size: 16 | lm loss: 1.046901E+01 | grad norm: 0.941 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 182/ 128728 | consumed samples: 2912 | consumed tokens: 5963776 | elapsed time per iteration (s): 15.22 | learning rate: 9.542E-07 | global batch size: 16 | lm loss: 1.012921E+01 | grad norm: 1.015 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 183/ 128728 | consumed samples: 2928 | consumed tokens: 5996544 | elapsed time per iteration (s): 15.20 | learning rate: 9.594E-07 | global batch size: 16 | lm loss: 1.034022E+01 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 184/ 128728 | consumed samples: 2944 | consumed tokens: 6029312 | elapsed time per iteration (s): 15.18 | learning rate: 9.647E-07 | global batch size: 16 | lm loss: 1.003381E+01 | grad norm: 0.934 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 185/ 128728 | consumed samples: 2960 | consumed tokens: 6062080 | elapsed time per iteration (s): 15.16 | learning rate: 9.699E-07 | global batch size: 16 | lm loss: 1.021115E+01 | grad norm: 0.919 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 186/ 128728 | consumed samples: 2976 | consumed tokens: 6094848 | elapsed time per iteration (s): 15.19 | learning rate: 9.752E-07 | global batch size: 16 | lm loss: 1.006208E+01 | grad norm: 1.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 187/ 128728 | consumed samples: 2992 | consumed tokens: 6127616 | elapsed time per iteration (s): 15.22 | learning rate: 9.804E-07 | global batch size: 16 | lm loss: 1.016780E+01 | grad norm: 1.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 188/ 128728 | consumed samples: 3008 | consumed tokens: 6160384 | elapsed time per iteration (s): 15.17 | learning rate: 9.857E-07 | global batch size: 16 | lm loss: 1.032679E+01 | grad norm: 0.960 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 189/ 128728 | consumed samples: 3024 | consumed tokens: 6193152 | elapsed time per iteration (s): 15.20 | learning rate: 9.909E-07 | global batch size: 16 | lm loss: 1.000206E+01 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 190/ 128728 | consumed samples: 3040 | consumed tokens: 6225920 | elapsed time per iteration (s): 15.21 | learning rate: 9.961E-07 | global batch size: 16 | lm loss: 1.015638E+01 | grad norm: 0.708 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 191/ 128728 | consumed samples: 3056 | consumed tokens: 6258688 | elapsed time per iteration (s): 15.21 | learning rate: 1.001E-06 | global batch size: 16 | lm loss: 9.991480E+00 | grad norm: 0.875 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 192/ 128728 | consumed samples: 3072 | consumed tokens: 6291456 | elapsed time per iteration (s): 15.20 | learning rate: 1.007E-06 | global batch size: 16 | lm loss: 1.009315E+01 | grad norm: 0.881 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 193/ 128728 | consumed samples: 3088 | consumed tokens: 6324224 | elapsed time per iteration (s): 15.26 | learning rate: 1.012E-06 | global batch size: 16 | lm loss: 9.941729E+00 | grad norm: 0.960 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 194/ 128728 | consumed samples: 3104 | consumed tokens: 6356992 | elapsed time per iteration (s): 15.23 | learning rate: 1.017E-06 | global batch size: 16 | lm loss: 1.005856E+01 | grad norm: 0.903 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 195/ 128728 | consumed samples: 3120 | consumed tokens: 6389760 | elapsed time per iteration (s): 15.20 | learning rate: 1.022E-06 | global batch size: 16 | lm loss: 1.016409E+01 | grad norm: 0.841 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 196/ 128728 | consumed samples: 3136 | consumed tokens: 6422528 | elapsed time per iteration (s): 15.23 | learning rate: 1.028E-06 | global batch size: 16 | lm loss: 9.989647E+00 | grad norm: 0.866 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 197/ 128728 | consumed samples: 3152 | consumed tokens: 6455296 | elapsed time per iteration (s): 15.23 | learning rate: 1.033E-06 | global batch size: 16 | lm loss: 9.962715E+00 | grad norm: 0.655 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 198/ 128728 | consumed samples: 3168 | consumed tokens: 6488064 | elapsed time per iteration (s): 15.24 | learning rate: 1.038E-06 | global batch size: 16 | lm loss: 1.009250E+01 | grad norm: 1.006 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 199/ 128728 | consumed samples: 3184 | consumed tokens: 6520832 | elapsed time per iteration (s): 15.22 | learning rate: 1.043E-06 | global batch size: 16 | lm loss: 9.905367E+00 | grad norm: 1.053 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 200/ 128728 | consumed samples: 3200 | consumed tokens: 6553600 | elapsed time per iteration (s): 15.20 | learning rate: 1.049E-06 | global batch size: 16 | lm loss: 1.007274E+01 | grad norm: 0.995 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 201/ 128728 | consumed samples: 3216 | consumed tokens: 6586368 | elapsed time per iteration (s): 15.22 | learning rate: 1.054E-06 | global batch size: 16 | lm loss: 9.892535E+00 | grad norm: 0.939 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 202/ 128728 | consumed samples: 3232 | consumed tokens: 6619136 | elapsed time per iteration (s): 15.22 | learning rate: 1.059E-06 | global batch size: 16 | lm loss: 9.908247E+00 | grad norm: 1.690 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 203/ 128728 | consumed samples: 3248 | consumed tokens: 6651904 | elapsed time per iteration (s): 15.20 | learning rate: 1.064E-06 | global batch size: 16 | lm loss: 9.759439E+00 | grad norm: 0.893 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 204/ 128728 | consumed samples: 3264 | consumed tokens: 6684672 | elapsed time per iteration (s): 15.23 | learning rate: 1.070E-06 | global batch size: 16 | lm loss: 9.843822E+00 | grad norm: 1.350 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 205/ 128728 | consumed samples: 3280 | consumed tokens: 6717440 | elapsed time per iteration (s): 15.20 | learning rate: 1.075E-06 | global batch size: 16 | lm loss: 9.970119E+00 | grad norm: 1.374 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 206/ 128728 | consumed samples: 3296 | consumed tokens: 6750208 | elapsed time per iteration (s): 15.23 | learning rate: 1.080E-06 | global batch size: 16 | lm loss: 1.008592E+01 | grad norm: 1.284 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 207/ 128728 | consumed samples: 3312 | consumed tokens: 6782976 | elapsed time per iteration (s): 15.21 | learning rate: 1.085E-06 | global batch size: 16 | lm loss: 9.928805E+00 | grad norm: 1.043 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 208/ 128728 | consumed samples: 3328 | consumed tokens: 6815744 | elapsed time per iteration (s): 15.20 | learning rate: 1.091E-06 | global batch size: 16 | lm loss: 9.940935E+00 | grad norm: 1.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 209/ 128728 | consumed samples: 3344 | consumed tokens: 6848512 | elapsed time per iteration (s): 15.23 | learning rate: 1.096E-06 | global batch size: 16 | lm loss: 9.809174E+00 | grad norm: 1.453 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 210/ 128728 | consumed samples: 3360 | consumed tokens: 6881280 | elapsed time per iteration (s): 15.24 | learning rate: 1.101E-06 | global batch size: 16 | lm loss: 9.955800E+00 | grad norm: 1.625 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 211/ 128728 | consumed samples: 3376 | consumed tokens: 6914048 | elapsed time per iteration (s): 15.22 | learning rate: 1.106E-06 | global batch size: 16 | lm loss: 9.909077E+00 | grad norm: 1.584 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 212/ 128728 | consumed samples: 3392 | consumed tokens: 6946816 | elapsed time per iteration (s): 15.23 | learning rate: 1.111E-06 | global batch size: 16 | lm loss: 9.912115E+00 | grad norm: 0.959 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 213/ 128728 | consumed samples: 3408 | consumed tokens: 6979584 | elapsed time per iteration (s): 15.21 | learning rate: 1.117E-06 | global batch size: 16 | lm loss: 9.802191E+00 | grad norm: 0.955 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 214/ 128728 | consumed samples: 3424 | consumed tokens: 7012352 | elapsed time per iteration (s): 15.22 | learning rate: 1.122E-06 | global batch size: 16 | lm loss: 9.900744E+00 | grad norm: 1.028 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 215/ 128728 | consumed samples: 3440 | consumed tokens: 7045120 | elapsed time per iteration (s): 15.21 | learning rate: 1.127E-06 | global batch size: 16 | lm loss: 9.727583E+00 | grad norm: 0.848 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 216/ 128728 | consumed samples: 3456 | consumed tokens: 7077888 | elapsed time per iteration (s): 15.22 | learning rate: 1.132E-06 | global batch size: 16 | lm loss: 9.846464E+00 | grad norm: 0.983 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 217/ 128728 | consumed samples: 3472 | consumed tokens: 7110656 | elapsed time per iteration (s): 15.19 | learning rate: 1.138E-06 | global batch size: 16 | lm loss: 1.001000E+01 | grad norm: 0.998 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 218/ 128728 | consumed samples: 3488 | consumed tokens: 7143424 | elapsed time per iteration (s): 15.23 | learning rate: 1.143E-06 | global batch size: 16 | lm loss: 9.839026E+00 | grad norm: 0.954 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 219/ 128728 | consumed samples: 3504 | consumed tokens: 7176192 | elapsed time per iteration (s): 15.22 | learning rate: 1.148E-06 | global batch size: 16 | lm loss: 9.730466E+00 | grad norm: 1.015 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 220/ 128728 | consumed samples: 3520 | consumed tokens: 7208960 | elapsed time per iteration (s): 15.17 | learning rate: 1.153E-06 | global batch size: 16 | lm loss: 9.767716E+00 | grad norm: 0.857 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 221/ 128728 | consumed samples: 3536 | consumed tokens: 7241728 | elapsed time per iteration (s): 15.18 | learning rate: 1.159E-06 | global batch size: 16 | lm loss: 9.788709E+00 | grad norm: 1.378 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 222/ 128728 | consumed samples: 3552 | consumed tokens: 7274496 | elapsed time per iteration (s): 15.19 | learning rate: 1.164E-06 | global batch size: 16 | lm loss: 9.750614E+00 | grad norm: 1.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 223/ 128728 | consumed samples: 3568 | consumed tokens: 7307264 | elapsed time per iteration (s): 15.20 | learning rate: 1.169E-06 | global batch size: 16 | lm loss: 9.629946E+00 | grad norm: 0.933 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 224/ 128728 | consumed samples: 3584 | consumed tokens: 7340032 | elapsed time per iteration (s): 15.20 | learning rate: 1.174E-06 | global batch size: 16 | lm loss: 1.004527E+01 | grad norm: 1.778 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 225/ 128728 | consumed samples: 3600 | consumed tokens: 7372800 | elapsed time per iteration (s): 15.24 | learning rate: 1.180E-06 | global batch size: 16 | lm loss: 9.818333E+00 | grad norm: 1.734 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 226/ 128728 | consumed samples: 3616 | consumed tokens: 7405568 | elapsed time per iteration (s): 15.18 | learning rate: 1.185E-06 | global batch size: 16 | lm loss: 9.859022E+00 | grad norm: 1.213 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 227/ 128728 | consumed samples: 3632 | consumed tokens: 7438336 | elapsed time per iteration (s): 15.17 | learning rate: 1.190E-06 | global batch size: 16 | lm loss: 9.774880E+00 | grad norm: 1.739 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 228/ 128728 | consumed samples: 3648 | consumed tokens: 7471104 | elapsed time per iteration (s): 15.22 | learning rate: 1.195E-06 | global batch size: 16 | lm loss: 9.777248E+00 | grad norm: 1.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 229/ 128728 | consumed samples: 3664 | consumed tokens: 7503872 | elapsed time per iteration (s): 15.23 | learning rate: 1.201E-06 | global batch size: 16 | lm loss: 9.784309E+00 | grad norm: 1.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 230/ 128728 | consumed samples: 3680 | consumed tokens: 7536640 | elapsed time per iteration (s): 15.21 | learning rate: 1.206E-06 | global batch size: 16 | lm loss: 9.753279E+00 | grad norm: 1.041 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 231/ 128728 | consumed samples: 3696 | consumed tokens: 7569408 | elapsed time per iteration (s): 15.22 | learning rate: 1.211E-06 | global batch size: 16 | lm loss: 9.784714E+00 | grad norm: 1.064 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 232/ 128728 | consumed samples: 3712 | consumed tokens: 7602176 | elapsed time per iteration (s): 15.22 | learning rate: 1.216E-06 | global batch size: 16 | lm loss: 9.695133E+00 | grad norm: 1.334 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 233/ 128728 | consumed samples: 3728 | consumed tokens: 7634944 | elapsed time per iteration (s): 15.22 | learning rate: 1.222E-06 | global batch size: 16 | lm loss: 9.556194E+00 | grad norm: 1.305 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 234/ 128728 | consumed samples: 3744 | consumed tokens: 7667712 | elapsed time per iteration (s): 15.19 | learning rate: 1.227E-06 | global batch size: 16 | lm loss: 9.775770E+00 | grad norm: 1.239 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 235/ 128728 | consumed samples: 3760 | consumed tokens: 7700480 | elapsed time per iteration (s): 15.22 | learning rate: 1.232E-06 | global batch size: 16 | lm loss: 9.595947E+00 | grad norm: 1.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 236/ 128728 | consumed samples: 3776 | consumed tokens: 7733248 | elapsed time per iteration (s): 15.21 | learning rate: 1.237E-06 | global batch size: 16 | lm loss: 9.768786E+00 | grad norm: 1.232 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 237/ 128728 | consumed samples: 3792 | consumed tokens: 7766016 | elapsed time per iteration (s): 15.24 | learning rate: 1.243E-06 | global batch size: 16 | lm loss: 9.527258E+00 | grad norm: 1.115 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 238/ 128728 | consumed samples: 3808 | consumed tokens: 7798784 | elapsed time per iteration (s): 15.22 | learning rate: 1.248E-06 | global batch size: 16 | lm loss: 9.808368E+00 | grad norm: 1.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 239/ 128728 | consumed samples: 3824 | consumed tokens: 7831552 | elapsed time per iteration (s): 15.20 | learning rate: 1.253E-06 | global batch size: 16 | lm loss: 9.664412E+00 | grad norm: 1.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 240/ 128728 | consumed samples: 3840 | consumed tokens: 7864320 | elapsed time per iteration (s): 15.21 | learning rate: 1.258E-06 | global batch size: 16 | lm loss: 9.680309E+00 | grad norm: 2.453 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 241/ 128728 | consumed samples: 3856 | consumed tokens: 7897088 | elapsed time per iteration (s): 15.22 | learning rate: 1.264E-06 | global batch size: 16 | lm loss: 9.523140E+00 | grad norm: 0.932 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 242/ 128728 | consumed samples: 3872 | consumed tokens: 7929856 | elapsed time per iteration (s): 15.18 | learning rate: 1.269E-06 | global batch size: 16 | lm loss: 9.746195E+00 | grad norm: 0.958 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 243/ 128728 | consumed samples: 3888 | consumed tokens: 7962624 | elapsed time per iteration (s): 15.20 | learning rate: 1.274E-06 | global batch size: 16 | lm loss: 9.654213E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 244/ 128728 | consumed samples: 3904 | consumed tokens: 7995392 | elapsed time per iteration (s): 15.19 | learning rate: 1.279E-06 | global batch size: 16 | lm loss: 9.681046E+00 | grad norm: 1.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 245/ 128728 | consumed samples: 3920 | consumed tokens: 8028160 | elapsed time per iteration (s): 15.22 | learning rate: 1.285E-06 | global batch size: 16 | lm loss: 9.748778E+00 | grad norm: 0.928 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 246/ 128728 | consumed samples: 3936 | consumed tokens: 8060928 | elapsed time per iteration (s): 15.21 | learning rate: 1.290E-06 | global batch size: 16 | lm loss: 9.600563E+00 | grad norm: 1.277 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 247/ 128728 | consumed samples: 3952 | consumed tokens: 8093696 | elapsed time per iteration (s): 15.18 | learning rate: 1.295E-06 | global batch size: 16 | lm loss: 9.489889E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 248/ 128728 | consumed samples: 3968 | consumed tokens: 8126464 | elapsed time per iteration (s): 15.22 | learning rate: 1.300E-06 | global batch size: 16 | lm loss: 9.397079E+00 | grad norm: 1.251 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 249/ 128728 | consumed samples: 3984 | consumed tokens: 8159232 | elapsed time per iteration (s): 15.19 | learning rate: 1.305E-06 | global batch size: 16 | lm loss: 9.610090E+00 | grad norm: 0.857 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 250/ 128728 | consumed samples: 4000 | consumed tokens: 8192000 | elapsed time per iteration (s): 15.22 | learning rate: 1.311E-06 | global batch size: 16 | lm loss: 9.520005E+00 | grad norm: 1.252 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 251/ 128728 | consumed samples: 4016 | consumed tokens: 8224768 | elapsed time per iteration (s): 15.23 | learning rate: 1.316E-06 | global batch size: 16 | lm loss: 9.354611E+00 | grad norm: 1.216 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 252/ 128728 | consumed samples: 4032 | consumed tokens: 8257536 | elapsed time per iteration (s): 15.26 | learning rate: 1.321E-06 | global batch size: 16 | lm loss: 9.402354E+00 | grad norm: 1.381 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 253/ 128728 | consumed samples: 4048 | consumed tokens: 8290304 | elapsed time per iteration (s): 15.25 | learning rate: 1.326E-06 | global batch size: 16 | lm loss: 9.472418E+00 | grad norm: 1.558 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 254/ 128728 | consumed samples: 4064 | consumed tokens: 8323072 | elapsed time per iteration (s): 15.24 | learning rate: 1.332E-06 | global batch size: 16 | lm loss: 9.596987E+00 | grad norm: 1.741 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 255/ 128728 | consumed samples: 4080 | consumed tokens: 8355840 | elapsed time per iteration (s): 15.25 | learning rate: 1.337E-06 | global batch size: 16 | lm loss: 9.526454E+00 | grad norm: 1.328 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 256/ 128728 | consumed samples: 4096 | consumed tokens: 8388608 | elapsed time per iteration (s): 15.29 | learning rate: 1.342E-06 | global batch size: 16 | lm loss: 9.607473E+00 | grad norm: 1.050 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.01 | [default7]: iteration 257/ 128728 | consumed samples: 4112 | consumed tokens: 8421376 | elapsed time per iteration (s): 15.25 | learning rate: 1.347E-06 | global batch size: 16 | lm loss: 9.439919E+00 | grad norm: 1.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 258/ 128728 | consumed samples: 4128 | consumed tokens: 8454144 | elapsed time per iteration (s): 15.23 | learning rate: 1.353E-06 | global batch size: 16 | lm loss: 9.616064E+00 | grad norm: 0.897 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 259/ 128728 | consumed samples: 4144 | consumed tokens: 8486912 | elapsed time per iteration (s): 15.27 | learning rate: 1.358E-06 | global batch size: 16 | lm loss: 9.386358E+00 | grad norm: 1.068 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 260/ 128728 | consumed samples: 4160 | consumed tokens: 8519680 | elapsed time per iteration (s): 15.23 | learning rate: 1.363E-06 | global batch size: 16 | lm loss: 9.311523E+00 | grad norm: 0.944 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 261/ 128728 | consumed samples: 4176 | consumed tokens: 8552448 | elapsed time per iteration (s): 15.17 | learning rate: 1.368E-06 | global batch size: 16 | lm loss: 9.406882E+00 | grad norm: 0.934 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 262/ 128728 | consumed samples: 4192 | consumed tokens: 8585216 | elapsed time per iteration (s): 15.20 | learning rate: 1.374E-06 | global batch size: 16 | lm loss: 9.483783E+00 | grad norm: 1.232 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 263/ 128728 | consumed samples: 4208 | consumed tokens: 8617984 | elapsed time per iteration (s): 15.21 | learning rate: 1.379E-06 | global batch size: 16 | lm loss: 9.435326E+00 | grad norm: 1.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 264/ 128728 | consumed samples: 4224 | consumed tokens: 8650752 | elapsed time per iteration (s): 15.23 | learning rate: 1.384E-06 | global batch size: 16 | lm loss: 9.483128E+00 | grad norm: 1.459 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 265/ 128728 | consumed samples: 4240 | consumed tokens: 8683520 | elapsed time per iteration (s): 15.24 | learning rate: 1.389E-06 | global batch size: 16 | lm loss: 9.487989E+00 | grad norm: 1.064 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 266/ 128728 | consumed samples: 4256 | consumed tokens: 8716288 | elapsed time per iteration (s): 15.24 | learning rate: 1.395E-06 | global batch size: 16 | lm loss: 9.551134E+00 | grad norm: 1.466 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 267/ 128728 | consumed samples: 4272 | consumed tokens: 8749056 | elapsed time per iteration (s): 15.24 | learning rate: 1.400E-06 | global batch size: 16 | lm loss: 9.242275E+00 | grad norm: 1.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 268/ 128728 | consumed samples: 4288 | consumed tokens: 8781824 | elapsed time per iteration (s): 15.26 | learning rate: 1.405E-06 | global batch size: 16 | lm loss: 9.469782E+00 | grad norm: 1.019 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 269/ 128728 | consumed samples: 4304 | consumed tokens: 8814592 | elapsed time per iteration (s): 15.24 | learning rate: 1.410E-06 | global batch size: 16 | lm loss: 9.499035E+00 | grad norm: 1.487 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 270/ 128728 | consumed samples: 4320 | consumed tokens: 8847360 | elapsed time per iteration (s): 15.22 | learning rate: 1.416E-06 | global batch size: 16 | lm loss: 9.467442E+00 | grad norm: 0.866 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 271/ 128728 | consumed samples: 4336 | consumed tokens: 8880128 | elapsed time per iteration (s): 15.24 | learning rate: 1.421E-06 | global batch size: 16 | lm loss: 9.442656E+00 | grad norm: 1.546 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 272/ 128728 | consumed samples: 4352 | consumed tokens: 8912896 | elapsed time per iteration (s): 15.21 | learning rate: 1.426E-06 | global batch size: 16 | lm loss: 9.322374E+00 | grad norm: 0.947 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 273/ 128728 | consumed samples: 4368 | consumed tokens: 8945664 | elapsed time per iteration (s): 15.21 | learning rate: 1.431E-06 | global batch size: 16 | lm loss: 9.270580E+00 | grad norm: 1.042 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 274/ 128728 | consumed samples: 4384 | consumed tokens: 8978432 | elapsed time per iteration (s): 15.22 | learning rate: 1.437E-06 | global batch size: 16 | lm loss: 9.374606E+00 | grad norm: 1.063 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 275/ 128728 | consumed samples: 4400 | consumed tokens: 9011200 | elapsed time per iteration (s): 15.24 | learning rate: 1.442E-06 | global batch size: 16 | lm loss: 9.264148E+00 | grad norm: 1.074 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 276/ 128728 | consumed samples: 4416 | consumed tokens: 9043968 | elapsed time per iteration (s): 15.26 | learning rate: 1.447E-06 | global batch size: 16 | lm loss: 9.256626E+00 | grad norm: 1.241 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 277/ 128728 | consumed samples: 4432 | consumed tokens: 9076736 | elapsed time per iteration (s): 15.26 | learning rate: 1.452E-06 | global batch size: 16 | lm loss: 9.479916E+00 | grad norm: 2.775 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 278/ 128728 | consumed samples: 4448 | consumed tokens: 9109504 | elapsed time per iteration (s): 15.25 | learning rate: 1.458E-06 | global batch size: 16 | lm loss: 9.171821E+00 | grad norm: 1.493 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.04 | [default7]: iteration 279/ 128728 | consumed samples: 4464 | consumed tokens: 9142272 | elapsed time per iteration (s): 15.25 | learning rate: 1.463E-06 | global batch size: 16 | lm loss: 9.419685E+00 | grad norm: 1.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 280/ 128728 | consumed samples: 4480 | consumed tokens: 9175040 | elapsed time per iteration (s): 15.21 | learning rate: 1.468E-06 | global batch size: 16 | lm loss: 9.336754E+00 | grad norm: 1.083 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 281/ 128728 | consumed samples: 4496 | consumed tokens: 9207808 | elapsed time per iteration (s): 15.24 | learning rate: 1.473E-06 | global batch size: 16 | lm loss: 9.144946E+00 | grad norm: 1.990 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 282/ 128728 | consumed samples: 4512 | consumed tokens: 9240576 | elapsed time per iteration (s): 15.24 | learning rate: 1.478E-06 | global batch size: 16 | lm loss: 9.401902E+00 | grad norm: 1.032 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 283/ 128728 | consumed samples: 4528 | consumed tokens: 9273344 | elapsed time per iteration (s): 15.21 | learning rate: 1.484E-06 | global batch size: 16 | lm loss: 9.207463E+00 | grad norm: 1.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 284/ 128728 | consumed samples: 4544 | consumed tokens: 9306112 | elapsed time per iteration (s): 15.21 | learning rate: 1.489E-06 | global batch size: 16 | lm loss: 9.289896E+00 | grad norm: 1.053 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 285/ 128728 | consumed samples: 4560 | consumed tokens: 9338880 | elapsed time per iteration (s): 15.21 | learning rate: 1.494E-06 | global batch size: 16 | lm loss: 9.052877E+00 | grad norm: 1.013 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 286/ 128728 | consumed samples: 4576 | consumed tokens: 9371648 | elapsed time per iteration (s): 15.19 | learning rate: 1.499E-06 | global batch size: 16 | lm loss: 9.375488E+00 | grad norm: 1.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 287/ 128728 | consumed samples: 4592 | consumed tokens: 9404416 | elapsed time per iteration (s): 15.21 | learning rate: 1.505E-06 | global batch size: 16 | lm loss: 9.356708E+00 | grad norm: 1.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 288/ 128728 | consumed samples: 4608 | consumed tokens: 9437184 | elapsed time per iteration (s): 15.22 | learning rate: 1.510E-06 | global batch size: 16 | lm loss: 9.133143E+00 | grad norm: 1.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 289/ 128728 | consumed samples: 4624 | consumed tokens: 9469952 | elapsed time per iteration (s): 15.26 | learning rate: 1.515E-06 | global batch size: 16 | lm loss: 9.436096E+00 | grad norm: 1.579 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 290/ 128728 | consumed samples: 4640 | consumed tokens: 9502720 | elapsed time per iteration (s): 15.21 | learning rate: 1.520E-06 | global batch size: 16 | lm loss: 9.226528E+00 | grad norm: 1.218 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 291/ 128728 | consumed samples: 4656 | consumed tokens: 9535488 | elapsed time per iteration (s): 15.19 | learning rate: 1.526E-06 | global batch size: 16 | lm loss: 9.340797E+00 | grad norm: 0.912 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 292/ 128728 | consumed samples: 4672 | consumed tokens: 9568256 | elapsed time per iteration (s): 15.22 | learning rate: 1.531E-06 | global batch size: 16 | lm loss: 9.186805E+00 | grad norm: 0.955 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 293/ 128728 | consumed samples: 4688 | consumed tokens: 9601024 | elapsed time per iteration (s): 15.21 | learning rate: 1.536E-06 | global batch size: 16 | lm loss: 9.120500E+00 | grad norm: 1.466 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 294/ 128728 | consumed samples: 4704 | consumed tokens: 9633792 | elapsed time per iteration (s): 15.19 | learning rate: 1.541E-06 | global batch size: 16 | lm loss: 9.039913E+00 | grad norm: 1.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 295/ 128728 | consumed samples: 4720 | consumed tokens: 9666560 | elapsed time per iteration (s): 15.26 | learning rate: 1.547E-06 | global batch size: 16 | lm loss: 9.181991E+00 | grad norm: 1.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 296/ 128728 | consumed samples: 4736 | consumed tokens: 9699328 | elapsed time per iteration (s): 15.23 | learning rate: 1.552E-06 | global batch size: 16 | lm loss: 9.090605E+00 | grad norm: 1.307 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 297/ 128728 | consumed samples: 4752 | consumed tokens: 9732096 | elapsed time per iteration (s): 15.25 | learning rate: 1.557E-06 | global batch size: 16 | lm loss: 9.270121E+00 | grad norm: 1.002 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 298/ 128728 | consumed samples: 4768 | consumed tokens: 9764864 | elapsed time per iteration (s): 15.21 | learning rate: 1.562E-06 | global batch size: 16 | lm loss: 9.101935E+00 | grad norm: 1.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 299/ 128728 | consumed samples: 4784 | consumed tokens: 9797632 | elapsed time per iteration (s): 15.21 | learning rate: 1.568E-06 | global batch size: 16 | lm loss: 9.210810E+00 | grad norm: 1.630 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 300/ 128728 | consumed samples: 4800 | consumed tokens: 9830400 | elapsed time per iteration (s): 15.21 | learning rate: 1.573E-06 | global batch size: 16 | lm loss: 9.187110E+00 | grad norm: 1.465 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 301/ 128728 | consumed samples: 4816 | consumed tokens: 9863168 | elapsed time per iteration (s): 15.22 | learning rate: 1.578E-06 | global batch size: 16 | lm loss: 9.143536E+00 | grad norm: 1.351 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 302/ 128728 | consumed samples: 4832 | consumed tokens: 9895936 | elapsed time per iteration (s): 15.22 | learning rate: 1.583E-06 | global batch size: 16 | lm loss: 9.160694E+00 | grad norm: 1.096 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 303/ 128728 | consumed samples: 4848 | consumed tokens: 9928704 | elapsed time per iteration (s): 15.24 | learning rate: 1.589E-06 | global batch size: 16 | lm loss: 9.221185E+00 | grad norm: 1.317 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 304/ 128728 | consumed samples: 4864 | consumed tokens: 9961472 | elapsed time per iteration (s): 15.21 | learning rate: 1.594E-06 | global batch size: 16 | lm loss: 9.189565E+00 | grad norm: 1.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 305/ 128728 | consumed samples: 4880 | consumed tokens: 9994240 | elapsed time per iteration (s): 15.21 | learning rate: 1.599E-06 | global batch size: 16 | lm loss: 9.239432E+00 | grad norm: 1.386 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 306/ 128728 | consumed samples: 4896 | consumed tokens: 10027008 | elapsed time per iteration (s): 15.23 | learning rate: 1.604E-06 | global batch size: 16 | lm loss: 9.193028E+00 | grad norm: 1.024 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 307/ 128728 | consumed samples: 4912 | consumed tokens: 10059776 | elapsed time per iteration (s): 15.22 | learning rate: 1.610E-06 | global batch size: 16 | lm loss: 9.158922E+00 | grad norm: 1.299 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 308/ 128728 | consumed samples: 4928 | consumed tokens: 10092544 | elapsed time per iteration (s): 15.21 | learning rate: 1.615E-06 | global batch size: 16 | lm loss: 9.136261E+00 | grad norm: 1.022 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 309/ 128728 | consumed samples: 4944 | consumed tokens: 10125312 | elapsed time per iteration (s): 15.23 | learning rate: 1.620E-06 | global batch size: 16 | lm loss: 9.243754E+00 | grad norm: 1.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 310/ 128728 | consumed samples: 4960 | consumed tokens: 10158080 | elapsed time per iteration (s): 15.23 | learning rate: 1.625E-06 | global batch size: 16 | lm loss: 9.191011E+00 | grad norm: 1.728 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 311/ 128728 | consumed samples: 4976 | consumed tokens: 10190848 | elapsed time per iteration (s): 15.21 | learning rate: 1.631E-06 | global batch size: 16 | lm loss: 9.023661E+00 | grad norm: 1.505 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 312/ 128728 | consumed samples: 4992 | consumed tokens: 10223616 | elapsed time per iteration (s): 15.26 | learning rate: 1.636E-06 | global batch size: 16 | lm loss: 9.186005E+00 | grad norm: 1.280 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 313/ 128728 | consumed samples: 5008 | consumed tokens: 10256384 | elapsed time per iteration (s): 15.23 | learning rate: 1.641E-06 | global batch size: 16 | lm loss: 9.018515E+00 | grad norm: 1.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 314/ 128728 | consumed samples: 5024 | consumed tokens: 10289152 | elapsed time per iteration (s): 15.19 | learning rate: 1.646E-06 | global batch size: 16 | lm loss: 8.974466E+00 | grad norm: 1.562 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 315/ 128728 | consumed samples: 5040 | consumed tokens: 10321920 | elapsed time per iteration (s): 15.17 | learning rate: 1.652E-06 | global batch size: 16 | lm loss: 9.060785E+00 | grad norm: 1.455 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 316/ 128728 | consumed samples: 5056 | consumed tokens: 10354688 | elapsed time per iteration (s): 15.25 | learning rate: 1.657E-06 | global batch size: 16 | lm loss: 9.159584E+00 | grad norm: 1.758 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 317/ 128728 | consumed samples: 5072 | consumed tokens: 10387456 | elapsed time per iteration (s): 15.19 | learning rate: 1.662E-06 | global batch size: 16 | lm loss: 9.113900E+00 | grad norm: 1.840 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 318/ 128728 | consumed samples: 5088 | consumed tokens: 10420224 | elapsed time per iteration (s): 15.23 | learning rate: 1.667E-06 | global batch size: 16 | lm loss: 9.078951E+00 | grad norm: 1.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 319/ 128728 | consumed samples: 5104 | consumed tokens: 10452992 | elapsed time per iteration (s): 15.21 | learning rate: 1.672E-06 | global batch size: 16 | lm loss: 9.058454E+00 | grad norm: 1.397 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 320/ 128728 | consumed samples: 5120 | consumed tokens: 10485760 | elapsed time per iteration (s): 15.22 | learning rate: 1.678E-06 | global batch size: 16 | lm loss: 9.104960E+00 | grad norm: 1.391 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 321/ 128728 | consumed samples: 5136 | consumed tokens: 10518528 | elapsed time per iteration (s): 15.19 | learning rate: 1.683E-06 | global batch size: 16 | lm loss: 8.983455E+00 | grad norm: 1.727 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 322/ 128728 | consumed samples: 5152 | consumed tokens: 10551296 | elapsed time per iteration (s): 15.22 | learning rate: 1.688E-06 | global batch size: 16 | lm loss: 8.882467E+00 | grad norm: 1.355 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 323/ 128728 | consumed samples: 5168 | consumed tokens: 10584064 | elapsed time per iteration (s): 15.21 | learning rate: 1.693E-06 | global batch size: 16 | lm loss: 8.978757E+00 | grad norm: 2.112 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 324/ 128728 | consumed samples: 5184 | consumed tokens: 10616832 | elapsed time per iteration (s): 15.15 | learning rate: 1.699E-06 | global batch size: 16 | lm loss: 9.070255E+00 | grad norm: 1.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 325/ 128728 | consumed samples: 5200 | consumed tokens: 10649600 | elapsed time per iteration (s): 15.20 | learning rate: 1.704E-06 | global batch size: 16 | lm loss: 9.185911E+00 | grad norm: 1.380 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 326/ 128728 | consumed samples: 5216 | consumed tokens: 10682368 | elapsed time per iteration (s): 15.23 | learning rate: 1.709E-06 | global batch size: 16 | lm loss: 8.935247E+00 | grad norm: 1.418 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 327/ 128728 | consumed samples: 5232 | consumed tokens: 10715136 | elapsed time per iteration (s): 15.24 | learning rate: 1.714E-06 | global batch size: 16 | lm loss: 8.980277E+00 | grad norm: 1.305 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 328/ 128728 | consumed samples: 5248 | consumed tokens: 10747904 | elapsed time per iteration (s): 15.23 | learning rate: 1.720E-06 | global batch size: 16 | lm loss: 9.004158E+00 | grad norm: 1.655 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 329/ 128728 | consumed samples: 5264 | consumed tokens: 10780672 | elapsed time per iteration (s): 15.17 | learning rate: 1.725E-06 | global batch size: 16 | lm loss: 9.141132E+00 | grad norm: 1.525 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 330/ 128728 | consumed samples: 5280 | consumed tokens: 10813440 | elapsed time per iteration (s): 15.20 | learning rate: 1.730E-06 | global batch size: 16 | lm loss: 9.098420E+00 | grad norm: 1.606 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 331/ 128728 | consumed samples: 5296 | consumed tokens: 10846208 | elapsed time per iteration (s): 15.21 | learning rate: 1.735E-06 | global batch size: 16 | lm loss: 9.103991E+00 | grad norm: 1.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 332/ 128728 | consumed samples: 5312 | consumed tokens: 10878976 | elapsed time per iteration (s): 15.18 | learning rate: 1.741E-06 | global batch size: 16 | lm loss: 9.196499E+00 | grad norm: 2.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 333/ 128728 | consumed samples: 5328 | consumed tokens: 10911744 | elapsed time per iteration (s): 15.24 | learning rate: 1.746E-06 | global batch size: 16 | lm loss: 8.898166E+00 | grad norm: 1.308 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 334/ 128728 | consumed samples: 5344 | consumed tokens: 10944512 | elapsed time per iteration (s): 15.23 | learning rate: 1.751E-06 | global batch size: 16 | lm loss: 8.988365E+00 | grad norm: 1.492 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 335/ 128728 | consumed samples: 5360 | consumed tokens: 10977280 | elapsed time per iteration (s): 15.25 | learning rate: 1.756E-06 | global batch size: 16 | lm loss: 8.947336E+00 | grad norm: 1.491 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 336/ 128728 | consumed samples: 5376 | consumed tokens: 11010048 | elapsed time per iteration (s): 15.20 | learning rate: 1.762E-06 | global batch size: 16 | lm loss: 8.925644E+00 | grad norm: 2.491 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 337/ 128728 | consumed samples: 5392 | consumed tokens: 11042816 | elapsed time per iteration (s): 15.26 | learning rate: 1.767E-06 | global batch size: 16 | lm loss: 8.995684E+00 | grad norm: 1.650 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 338/ 128728 | consumed samples: 5408 | consumed tokens: 11075584 | elapsed time per iteration (s): 15.24 | learning rate: 1.772E-06 | global batch size: 16 | lm loss: 8.828646E+00 | grad norm: 1.639 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 339/ 128728 | consumed samples: 5424 | consumed tokens: 11108352 | elapsed time per iteration (s): 15.23 | learning rate: 1.777E-06 | global batch size: 16 | lm loss: 8.849914E+00 | grad norm: 1.488 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 340/ 128728 | consumed samples: 5440 | consumed tokens: 11141120 | elapsed time per iteration (s): 15.22 | learning rate: 1.783E-06 | global batch size: 16 | lm loss: 8.669468E+00 | grad norm: 1.731 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 341/ 128728 | consumed samples: 5456 | consumed tokens: 11173888 | elapsed time per iteration (s): 15.22 | learning rate: 1.788E-06 | global batch size: 16 | lm loss: 8.875322E+00 | grad norm: 1.433 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 342/ 128728 | consumed samples: 5472 | consumed tokens: 11206656 | elapsed time per iteration (s): 15.24 | learning rate: 1.793E-06 | global batch size: 16 | lm loss: 9.113847E+00 | grad norm: 1.601 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 343/ 128728 | consumed samples: 5488 | consumed tokens: 11239424 | elapsed time per iteration (s): 15.25 | learning rate: 1.798E-06 | global batch size: 16 | lm loss: 9.039911E+00 | grad norm: 2.029 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 344/ 128728 | consumed samples: 5504 | consumed tokens: 11272192 | elapsed time per iteration (s): 15.25 | learning rate: 1.804E-06 | global batch size: 16 | lm loss: 8.952249E+00 | grad norm: 1.320 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 345/ 128728 | consumed samples: 5520 | consumed tokens: 11304960 | elapsed time per iteration (s): 15.19 | learning rate: 1.809E-06 | global batch size: 16 | lm loss: 9.029071E+00 | grad norm: 2.857 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 346/ 128728 | consumed samples: 5536 | consumed tokens: 11337728 | elapsed time per iteration (s): 15.22 | learning rate: 1.814E-06 | global batch size: 16 | lm loss: 8.957701E+00 | grad norm: 2.596 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 347/ 128728 | consumed samples: 5552 | consumed tokens: 11370496 | elapsed time per iteration (s): 15.19 | learning rate: 1.819E-06 | global batch size: 16 | lm loss: 9.178146E+00 | grad norm: 2.227 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 348/ 128728 | consumed samples: 5568 | consumed tokens: 11403264 | elapsed time per iteration (s): 15.24 | learning rate: 1.825E-06 | global batch size: 16 | lm loss: 8.739803E+00 | grad norm: 2.014 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 349/ 128728 | consumed samples: 5584 | consumed tokens: 11436032 | elapsed time per iteration (s): 15.21 | learning rate: 1.830E-06 | global batch size: 16 | lm loss: 9.074715E+00 | grad norm: 1.441 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 350/ 128728 | consumed samples: 5600 | consumed tokens: 11468800 | elapsed time per iteration (s): 15.23 | learning rate: 1.835E-06 | global batch size: 16 | lm loss: 8.816961E+00 | grad norm: 2.010 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 351/ 128728 | consumed samples: 5616 | consumed tokens: 11501568 | elapsed time per iteration (s): 15.21 | learning rate: 1.840E-06 | global batch size: 16 | lm loss: 9.123592E+00 | grad norm: 2.014 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 352/ 128728 | consumed samples: 5632 | consumed tokens: 11534336 | elapsed time per iteration (s): 15.24 | learning rate: 1.845E-06 | global batch size: 16 | lm loss: 9.053972E+00 | grad norm: 1.400 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 353/ 128728 | consumed samples: 5648 | consumed tokens: 11567104 | elapsed time per iteration (s): 15.23 | learning rate: 1.851E-06 | global batch size: 16 | lm loss: 8.837742E+00 | grad norm: 1.898 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 354/ 128728 | consumed samples: 5664 | consumed tokens: 11599872 | elapsed time per iteration (s): 15.25 | learning rate: 1.856E-06 | global batch size: 16 | lm loss: 8.724428E+00 | grad norm: 1.992 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 355/ 128728 | consumed samples: 5680 | consumed tokens: 11632640 | elapsed time per iteration (s): 15.26 | learning rate: 1.861E-06 | global batch size: 16 | lm loss: 8.793618E+00 | grad norm: 1.794 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 356/ 128728 | consumed samples: 5696 | consumed tokens: 11665408 | elapsed time per iteration (s): 15.22 | learning rate: 1.866E-06 | global batch size: 16 | lm loss: 8.806067E+00 | grad norm: 1.772 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 357/ 128728 | consumed samples: 5712 | consumed tokens: 11698176 | elapsed time per iteration (s): 15.20 | learning rate: 1.872E-06 | global batch size: 16 | lm loss: 8.795446E+00 | grad norm: 1.873 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 358/ 128728 | consumed samples: 5728 | consumed tokens: 11730944 | elapsed time per iteration (s): 15.24 | learning rate: 1.877E-06 | global batch size: 16 | lm loss: 8.763588E+00 | grad norm: 1.667 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 359/ 128728 | consumed samples: 5744 | consumed tokens: 11763712 | elapsed time per iteration (s): 15.26 | learning rate: 1.882E-06 | global batch size: 16 | lm loss: 8.908950E+00 | grad norm: 1.528 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 360/ 128728 | consumed samples: 5760 | consumed tokens: 11796480 | elapsed time per iteration (s): 15.19 | learning rate: 1.887E-06 | global batch size: 16 | lm loss: 8.781729E+00 | grad norm: 1.996 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 361/ 128728 | consumed samples: 5776 | consumed tokens: 11829248 | elapsed time per iteration (s): 15.19 | learning rate: 1.893E-06 | global batch size: 16 | lm loss: 8.808187E+00 | grad norm: 1.622 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 362/ 128728 | consumed samples: 5792 | consumed tokens: 11862016 | elapsed time per iteration (s): 15.25 | learning rate: 1.898E-06 | global batch size: 16 | lm loss: 8.742043E+00 | grad norm: 1.576 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 363/ 128728 | consumed samples: 5808 | consumed tokens: 11894784 | elapsed time per iteration (s): 15.27 | learning rate: 1.903E-06 | global batch size: 16 | lm loss: 8.903679E+00 | grad norm: 1.292 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 364/ 128728 | consumed samples: 5824 | consumed tokens: 11927552 | elapsed time per iteration (s): 15.25 | learning rate: 1.908E-06 | global batch size: 16 | lm loss: 8.821105E+00 | grad norm: 1.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 365/ 128728 | consumed samples: 5840 | consumed tokens: 11960320 | elapsed time per iteration (s): 15.24 | learning rate: 1.914E-06 | global batch size: 16 | lm loss: 8.744251E+00 | grad norm: 2.032 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 366/ 128728 | consumed samples: 5856 | consumed tokens: 11993088 | elapsed time per iteration (s): 15.24 | learning rate: 1.919E-06 | global batch size: 16 | lm loss: 8.918768E+00 | grad norm: 1.993 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 367/ 128728 | consumed samples: 5872 | consumed tokens: 12025856 | elapsed time per iteration (s): 15.24 | learning rate: 1.924E-06 | global batch size: 16 | lm loss: 8.736933E+00 | grad norm: 3.109 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 368/ 128728 | consumed samples: 5888 | consumed tokens: 12058624 | elapsed time per iteration (s): 15.24 | learning rate: 1.929E-06 | global batch size: 16 | lm loss: 8.928401E+00 | grad norm: 3.081 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 369/ 128728 | consumed samples: 5904 | consumed tokens: 12091392 | elapsed time per iteration (s): 15.24 | learning rate: 1.935E-06 | global batch size: 16 | lm loss: 8.997413E+00 | grad norm: 2.562 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 370/ 128728 | consumed samples: 5920 | consumed tokens: 12124160 | elapsed time per iteration (s): 15.18 | learning rate: 1.940E-06 | global batch size: 16 | lm loss: 8.656151E+00 | grad norm: 2.452 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 371/ 128728 | consumed samples: 5936 | consumed tokens: 12156928 | elapsed time per iteration (s): 15.21 | learning rate: 1.945E-06 | global batch size: 16 | lm loss: 8.794637E+00 | grad norm: 2.540 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 372/ 128728 | consumed samples: 5952 | consumed tokens: 12189696 | elapsed time per iteration (s): 15.24 | learning rate: 1.950E-06 | global batch size: 16 | lm loss: 8.713245E+00 | grad norm: 2.233 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 373/ 128728 | consumed samples: 5968 | consumed tokens: 12222464 | elapsed time per iteration (s): 15.24 | learning rate: 1.956E-06 | global batch size: 16 | lm loss: 8.920404E+00 | grad norm: 1.386 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 374/ 128728 | consumed samples: 5984 | consumed tokens: 12255232 | elapsed time per iteration (s): 15.24 | learning rate: 1.961E-06 | global batch size: 16 | lm loss: 8.724771E+00 | grad norm: 2.832 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 375/ 128728 | consumed samples: 6000 | consumed tokens: 12288000 | elapsed time per iteration (s): 15.24 | learning rate: 1.966E-06 | global batch size: 16 | lm loss: 8.874722E+00 | grad norm: 2.098 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 376/ 128728 | consumed samples: 6016 | consumed tokens: 12320768 | elapsed time per iteration (s): 15.24 | learning rate: 1.971E-06 | global batch size: 16 | lm loss: 8.530634E+00 | grad norm: 1.285 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 377/ 128728 | consumed samples: 6032 | consumed tokens: 12353536 | elapsed time per iteration (s): 15.22 | learning rate: 1.977E-06 | global batch size: 16 | lm loss: 8.726177E+00 | grad norm: 2.358 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 378/ 128728 | consumed samples: 6048 | consumed tokens: 12386304 | elapsed time per iteration (s): 15.24 | learning rate: 1.982E-06 | global batch size: 16 | lm loss: 8.662714E+00 | grad norm: 1.946 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 379/ 128728 | consumed samples: 6064 | consumed tokens: 12419072 | elapsed time per iteration (s): 15.26 | learning rate: 1.987E-06 | global batch size: 16 | lm loss: 8.682480E+00 | grad norm: 1.877 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 380/ 128728 | consumed samples: 6080 | consumed tokens: 12451840 | elapsed time per iteration (s): 15.24 | learning rate: 1.992E-06 | global batch size: 16 | lm loss: 8.867916E+00 | grad norm: 1.969 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 381/ 128728 | consumed samples: 6096 | consumed tokens: 12484608 | elapsed time per iteration (s): 15.23 | learning rate: 1.998E-06 | global batch size: 16 | lm loss: 8.751642E+00 | grad norm: 2.013 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 382/ 128728 | consumed samples: 6112 | consumed tokens: 12517376 | elapsed time per iteration (s): 15.27 | learning rate: 2.003E-06 | global batch size: 16 | lm loss: 8.746722E+00 | grad norm: 1.271 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 383/ 128728 | consumed samples: 6128 | consumed tokens: 12550144 | elapsed time per iteration (s): 15.23 | learning rate: 2.008E-06 | global batch size: 16 | lm loss: 8.698657E+00 | grad norm: 2.237 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 384/ 128728 | consumed samples: 6144 | consumed tokens: 12582912 | elapsed time per iteration (s): 15.27 | learning rate: 2.013E-06 | global batch size: 16 | lm loss: 8.771927E+00 | grad norm: 1.493 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 385/ 128728 | consumed samples: 6160 | consumed tokens: 12615680 | elapsed time per iteration (s): 15.22 | learning rate: 2.019E-06 | global batch size: 16 | lm loss: 8.916728E+00 | grad norm: 2.030 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 386/ 128728 | consumed samples: 6176 | consumed tokens: 12648448 | elapsed time per iteration (s): 15.25 | learning rate: 2.024E-06 | global batch size: 16 | lm loss: 8.761660E+00 | grad norm: 2.348 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.04 | [default7]: iteration 387/ 128728 | consumed samples: 6192 | consumed tokens: 12681216 | elapsed time per iteration (s): 15.26 | learning rate: 2.029E-06 | global batch size: 16 | lm loss: 8.588232E+00 | grad norm: 1.430 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 388/ 128728 | consumed samples: 6208 | consumed tokens: 12713984 | elapsed time per iteration (s): 15.22 | learning rate: 2.034E-06 | global batch size: 16 | lm loss: 8.639584E+00 | grad norm: 2.459 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 389/ 128728 | consumed samples: 6224 | consumed tokens: 12746752 | elapsed time per iteration (s): 15.25 | learning rate: 2.039E-06 | global batch size: 16 | lm loss: 8.722241E+00 | grad norm: 2.932 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 390/ 128728 | consumed samples: 6240 | consumed tokens: 12779520 | elapsed time per iteration (s): 15.25 | learning rate: 2.045E-06 | global batch size: 16 | lm loss: 8.805967E+00 | grad norm: 2.410 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 391/ 128728 | consumed samples: 6256 | consumed tokens: 12812288 | elapsed time per iteration (s): 15.25 | learning rate: 2.050E-06 | global batch size: 16 | lm loss: 8.767456E+00 | grad norm: 1.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 392/ 128728 | consumed samples: 6272 | consumed tokens: 12845056 | elapsed time per iteration (s): 15.24 | learning rate: 2.055E-06 | global batch size: 16 | lm loss: 8.722268E+00 | grad norm: 2.268 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 393/ 128728 | consumed samples: 6288 | consumed tokens: 12877824 | elapsed time per iteration (s): 15.22 | learning rate: 2.060E-06 | global batch size: 16 | lm loss: 8.755892E+00 | grad norm: 1.321 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 394/ 128728 | consumed samples: 6304 | consumed tokens: 12910592 | elapsed time per iteration (s): 15.25 | learning rate: 2.066E-06 | global batch size: 16 | lm loss: 8.785294E+00 | grad norm: 1.590 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.04 | [default7]: iteration 395/ 128728 | consumed samples: 6320 | consumed tokens: 12943360 | elapsed time per iteration (s): 15.23 | learning rate: 2.071E-06 | global batch size: 16 | lm loss: 8.598062E+00 | grad norm: 1.885 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 396/ 128728 | consumed samples: 6336 | consumed tokens: 12976128 | elapsed time per iteration (s): 15.17 | learning rate: 2.076E-06 | global batch size: 16 | lm loss: 8.763098E+00 | grad norm: 1.409 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 397/ 128728 | consumed samples: 6352 | consumed tokens: 13008896 | elapsed time per iteration (s): 15.23 | learning rate: 2.081E-06 | global batch size: 16 | lm loss: 8.675168E+00 | grad norm: 1.460 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 398/ 128728 | consumed samples: 6368 | consumed tokens: 13041664 | elapsed time per iteration (s): 15.23 | learning rate: 2.087E-06 | global batch size: 16 | lm loss: 8.532794E+00 | grad norm: 1.052 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 399/ 128728 | consumed samples: 6384 | consumed tokens: 13074432 | elapsed time per iteration (s): 15.22 | learning rate: 2.092E-06 | global batch size: 16 | lm loss: 8.650246E+00 | grad norm: 1.473 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 400/ 128728 | consumed samples: 6400 | consumed tokens: 13107200 | elapsed time per iteration (s): 15.22 | learning rate: 2.097E-06 | global batch size: 16 | lm loss: 8.503979E+00 | grad norm: 1.464 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 401/ 128728 | consumed samples: 6416 | consumed tokens: 13139968 | elapsed time per iteration (s): 15.22 | learning rate: 2.102E-06 | global batch size: 16 | lm loss: 8.529534E+00 | grad norm: 1.292 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 402/ 128728 | consumed samples: 6432 | consumed tokens: 13172736 | elapsed time per iteration (s): 15.24 | learning rate: 2.108E-06 | global batch size: 16 | lm loss: 8.620544E+00 | grad norm: 1.908 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 403/ 128728 | consumed samples: 6448 | consumed tokens: 13205504 | elapsed time per iteration (s): 15.26 | learning rate: 2.113E-06 | global batch size: 16 | lm loss: 8.570610E+00 | grad norm: 1.790 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 404/ 128728 | consumed samples: 6464 | consumed tokens: 13238272 | elapsed time per iteration (s): 15.25 | learning rate: 2.118E-06 | global batch size: 16 | lm loss: 8.559856E+00 | grad norm: 1.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 405/ 128728 | consumed samples: 6480 | consumed tokens: 13271040 | elapsed time per iteration (s): 15.28 | learning rate: 2.123E-06 | global batch size: 16 | lm loss: 8.603176E+00 | grad norm: 1.326 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 406/ 128728 | consumed samples: 6496 | consumed tokens: 13303808 | elapsed time per iteration (s): 15.18 | learning rate: 2.129E-06 | global batch size: 16 | lm loss: 8.468877E+00 | grad norm: 1.470 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 407/ 128728 | consumed samples: 6512 | consumed tokens: 13336576 | elapsed time per iteration (s): 15.25 | learning rate: 2.134E-06 | global batch size: 16 | lm loss: 8.496984E+00 | grad norm: 1.102 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 408/ 128728 | consumed samples: 6528 | consumed tokens: 13369344 | elapsed time per iteration (s): 15.18 | learning rate: 2.139E-06 | global batch size: 16 | lm loss: 8.568752E+00 | grad norm: 1.703 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 409/ 128728 | consumed samples: 6544 | consumed tokens: 13402112 | elapsed time per iteration (s): 15.17 | learning rate: 2.144E-06 | global batch size: 16 | lm loss: 8.504786E+00 | grad norm: 2.098 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 410/ 128728 | consumed samples: 6560 | consumed tokens: 13434880 | elapsed time per iteration (s): 15.23 | learning rate: 2.150E-06 | global batch size: 16 | lm loss: 8.729224E+00 | grad norm: 1.265 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 411/ 128728 | consumed samples: 6576 | consumed tokens: 13467648 | elapsed time per iteration (s): 15.18 | learning rate: 2.155E-06 | global batch size: 16 | lm loss: 8.696260E+00 | grad norm: 3.105 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 412/ 128728 | consumed samples: 6592 | consumed tokens: 13500416 | elapsed time per iteration (s): 15.24 | learning rate: 2.160E-06 | global batch size: 16 | lm loss: 8.525265E+00 | grad norm: 1.780 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 413/ 128728 | consumed samples: 6608 | consumed tokens: 13533184 | elapsed time per iteration (s): 15.22 | learning rate: 2.165E-06 | global batch size: 16 | lm loss: 8.653839E+00 | grad norm: 3.276 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 414/ 128728 | consumed samples: 6624 | consumed tokens: 13565952 | elapsed time per iteration (s): 15.24 | learning rate: 2.171E-06 | global batch size: 16 | lm loss: 8.959422E+00 | grad norm: 4.840 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 415/ 128728 | consumed samples: 6640 | consumed tokens: 13598720 | elapsed time per iteration (s): 15.23 | learning rate: 2.176E-06 | global batch size: 16 | lm loss: 8.594271E+00 | grad norm: 1.755 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 416/ 128728 | consumed samples: 6656 | consumed tokens: 13631488 | elapsed time per iteration (s): 15.21 | learning rate: 2.181E-06 | global batch size: 16 | lm loss: 8.770068E+00 | grad norm: 1.618 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 417/ 128728 | consumed samples: 6672 | consumed tokens: 13664256 | elapsed time per iteration (s): 15.21 | learning rate: 2.186E-06 | global batch size: 16 | lm loss: 8.684436E+00 | grad norm: 2.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 418/ 128728 | consumed samples: 6688 | consumed tokens: 13697024 | elapsed time per iteration (s): 15.23 | learning rate: 2.192E-06 | global batch size: 16 | lm loss: 8.469204E+00 | grad norm: 1.711 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 419/ 128728 | consumed samples: 6704 | consumed tokens: 13729792 | elapsed time per iteration (s): 15.19 | learning rate: 2.197E-06 | global batch size: 16 | lm loss: 8.532163E+00 | grad norm: 2.872 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 420/ 128728 | consumed samples: 6720 | consumed tokens: 13762560 | elapsed time per iteration (s): 15.25 | learning rate: 2.202E-06 | global batch size: 16 | lm loss: 8.762425E+00 | grad norm: 2.761 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 421/ 128728 | consumed samples: 6736 | consumed tokens: 13795328 | elapsed time per iteration (s): 15.20 | learning rate: 2.207E-06 | global batch size: 16 | lm loss: 8.625541E+00 | grad norm: 1.791 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 422/ 128728 | consumed samples: 6752 | consumed tokens: 13828096 | elapsed time per iteration (s): 15.20 | learning rate: 2.213E-06 | global batch size: 16 | lm loss: 8.546190E+00 | grad norm: 2.494 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 423/ 128728 | consumed samples: 6768 | consumed tokens: 13860864 | elapsed time per iteration (s): 15.24 | learning rate: 2.218E-06 | global batch size: 16 | lm loss: 8.478785E+00 | grad norm: 1.756 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 424/ 128728 | consumed samples: 6784 | consumed tokens: 13893632 | elapsed time per iteration (s): 15.23 | learning rate: 2.223E-06 | global batch size: 16 | lm loss: 8.501416E+00 | grad norm: 2.448 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 425/ 128728 | consumed samples: 6800 | consumed tokens: 13926400 | elapsed time per iteration (s): 15.25 | learning rate: 2.228E-06 | global batch size: 16 | lm loss: 8.284233E+00 | grad norm: 2.423 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 426/ 128728 | consumed samples: 6816 | consumed tokens: 13959168 | elapsed time per iteration (s): 15.20 | learning rate: 2.233E-06 | global batch size: 16 | lm loss: 8.605833E+00 | grad norm: 1.574 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 427/ 128728 | consumed samples: 6832 | consumed tokens: 13991936 | elapsed time per iteration (s): 15.22 | learning rate: 2.239E-06 | global batch size: 16 | lm loss: 8.659263E+00 | grad norm: 2.362 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 428/ 128728 | consumed samples: 6848 | consumed tokens: 14024704 | elapsed time per iteration (s): 15.26 | learning rate: 2.244E-06 | global batch size: 16 | lm loss: 8.621931E+00 | grad norm: 1.420 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 429/ 128728 | consumed samples: 6864 | consumed tokens: 14057472 | elapsed time per iteration (s): 15.22 | learning rate: 2.249E-06 | global batch size: 16 | lm loss: 8.517220E+00 | grad norm: 2.533 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 430/ 128728 | consumed samples: 6880 | consumed tokens: 14090240 | elapsed time per iteration (s): 15.17 | learning rate: 2.254E-06 | global batch size: 16 | lm loss: 8.515087E+00 | grad norm: 2.479 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 431/ 128728 | consumed samples: 6896 | consumed tokens: 14123008 | elapsed time per iteration (s): 15.24 | learning rate: 2.260E-06 | global batch size: 16 | lm loss: 8.327739E+00 | grad norm: 1.730 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 432/ 128728 | consumed samples: 6912 | consumed tokens: 14155776 | elapsed time per iteration (s): 15.20 | learning rate: 2.265E-06 | global batch size: 16 | lm loss: 8.415800E+00 | grad norm: 1.739 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 433/ 128728 | consumed samples: 6928 | consumed tokens: 14188544 | elapsed time per iteration (s): 15.19 | learning rate: 2.270E-06 | global batch size: 16 | lm loss: 8.553007E+00 | grad norm: 2.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 434/ 128728 | consumed samples: 6944 | consumed tokens: 14221312 | elapsed time per iteration (s): 15.22 | learning rate: 2.275E-06 | global batch size: 16 | lm loss: 8.405775E+00 | grad norm: 1.633 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 435/ 128728 | consumed samples: 6960 | consumed tokens: 14254080 | elapsed time per iteration (s): 15.24 | learning rate: 2.281E-06 | global batch size: 16 | lm loss: 8.622299E+00 | grad norm: 2.424 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 436/ 128728 | consumed samples: 6976 | consumed tokens: 14286848 | elapsed time per iteration (s): 15.24 | learning rate: 2.286E-06 | global batch size: 16 | lm loss: 8.557680E+00 | grad norm: 1.607 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 437/ 128728 | consumed samples: 6992 | consumed tokens: 14319616 | elapsed time per iteration (s): 15.22 | learning rate: 2.291E-06 | global batch size: 16 | lm loss: 8.482496E+00 | grad norm: 1.482 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 438/ 128728 | consumed samples: 7008 | consumed tokens: 14352384 | elapsed time per iteration (s): 15.18 | learning rate: 2.296E-06 | global batch size: 16 | lm loss: 8.464623E+00 | grad norm: 1.392 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 439/ 128728 | consumed samples: 7024 | consumed tokens: 14385152 | elapsed time per iteration (s): 15.23 | learning rate: 2.302E-06 | global batch size: 16 | lm loss: 8.540413E+00 | grad norm: 1.409 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 440/ 128728 | consumed samples: 7040 | consumed tokens: 14417920 | elapsed time per iteration (s): 15.21 | learning rate: 2.307E-06 | global batch size: 16 | lm loss: 8.238720E+00 | grad norm: 1.500 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 441/ 128728 | consumed samples: 7056 | consumed tokens: 14450688 | elapsed time per iteration (s): 15.23 | learning rate: 2.312E-06 | global batch size: 16 | lm loss: 8.452703E+00 | grad norm: 1.373 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 442/ 128728 | consumed samples: 7072 | consumed tokens: 14483456 | elapsed time per iteration (s): 15.21 | learning rate: 2.317E-06 | global batch size: 16 | lm loss: 8.485899E+00 | grad norm: 1.983 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 443/ 128728 | consumed samples: 7088 | consumed tokens: 14516224 | elapsed time per iteration (s): 15.23 | learning rate: 2.323E-06 | global batch size: 16 | lm loss: 8.319616E+00 | grad norm: 1.722 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 444/ 128728 | consumed samples: 7104 | consumed tokens: 14548992 | elapsed time per iteration (s): 15.19 | learning rate: 2.328E-06 | global batch size: 16 | lm loss: 8.515532E+00 | grad norm: 2.338 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 445/ 128728 | consumed samples: 7120 | consumed tokens: 14581760 | elapsed time per iteration (s): 15.26 | learning rate: 2.333E-06 | global batch size: 16 | lm loss: 8.538868E+00 | grad norm: 1.501 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 446/ 128728 | consumed samples: 7136 | consumed tokens: 14614528 | elapsed time per iteration (s): 15.24 | learning rate: 2.338E-06 | global batch size: 16 | lm loss: 8.450447E+00 | grad norm: 2.804 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 447/ 128728 | consumed samples: 7152 | consumed tokens: 14647296 | elapsed time per iteration (s): 15.21 | learning rate: 2.344E-06 | global batch size: 16 | lm loss: 8.434704E+00 | grad norm: 1.665 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 448/ 128728 | consumed samples: 7168 | consumed tokens: 14680064 | elapsed time per iteration (s): 15.23 | learning rate: 2.349E-06 | global batch size: 16 | lm loss: 8.462121E+00 | grad norm: 2.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 449/ 128728 | consumed samples: 7184 | consumed tokens: 14712832 | elapsed time per iteration (s): 15.28 | learning rate: 2.354E-06 | global batch size: 16 | lm loss: 8.375209E+00 | grad norm: 1.431 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 450/ 128728 | consumed samples: 7200 | consumed tokens: 14745600 | elapsed time per iteration (s): 15.23 | learning rate: 2.359E-06 | global batch size: 16 | lm loss: 8.421515E+00 | grad norm: 3.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 451/ 128728 | consumed samples: 7216 | consumed tokens: 14778368 | elapsed time per iteration (s): 15.25 | learning rate: 2.365E-06 | global batch size: 16 | lm loss: 8.501270E+00 | grad norm: 2.456 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 452/ 128728 | consumed samples: 7232 | consumed tokens: 14811136 | elapsed time per iteration (s): 15.21 | learning rate: 2.370E-06 | global batch size: 16 | lm loss: 8.473967E+00 | grad norm: 2.406 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 453/ 128728 | consumed samples: 7248 | consumed tokens: 14843904 | elapsed time per iteration (s): 15.22 | learning rate: 2.375E-06 | global batch size: 16 | lm loss: 8.457233E+00 | grad norm: 2.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 454/ 128728 | consumed samples: 7264 | consumed tokens: 14876672 | elapsed time per iteration (s): 15.20 | learning rate: 2.380E-06 | global batch size: 16 | lm loss: 8.371508E+00 | grad norm: 2.295 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 455/ 128728 | consumed samples: 7280 | consumed tokens: 14909440 | elapsed time per iteration (s): 15.23 | learning rate: 2.386E-06 | global batch size: 16 | lm loss: 8.326353E+00 | grad norm: 2.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 456/ 128728 | consumed samples: 7296 | consumed tokens: 14942208 | elapsed time per iteration (s): 15.22 | learning rate: 2.391E-06 | global batch size: 16 | lm loss: 8.511120E+00 | grad norm: 1.501 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 457/ 128728 | consumed samples: 7312 | consumed tokens: 14974976 | elapsed time per iteration (s): 15.15 | learning rate: 2.396E-06 | global batch size: 16 | lm loss: 8.472582E+00 | grad norm: 1.793 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 458/ 128728 | consumed samples: 7328 | consumed tokens: 15007744 | elapsed time per iteration (s): 15.22 | learning rate: 2.401E-06 | global batch size: 16 | lm loss: 8.273072E+00 | grad norm: 7.604 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 459/ 128728 | consumed samples: 7344 | consumed tokens: 15040512 | elapsed time per iteration (s): 15.22 | learning rate: 2.406E-06 | global batch size: 16 | lm loss: 8.573572E+00 | grad norm: 3.482 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 460/ 128728 | consumed samples: 7360 | consumed tokens: 15073280 | elapsed time per iteration (s): 15.22 | learning rate: 2.412E-06 | global batch size: 16 | lm loss: 8.714581E+00 | grad norm: 2.643 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 461/ 128728 | consumed samples: 7376 | consumed tokens: 15106048 | elapsed time per iteration (s): 15.22 | learning rate: 2.417E-06 | global batch size: 16 | lm loss: 8.068087E+00 | grad norm: 2.726 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 462/ 128728 | consumed samples: 7392 | consumed tokens: 15138816 | elapsed time per iteration (s): 15.24 | learning rate: 2.422E-06 | global batch size: 16 | lm loss: 8.380728E+00 | grad norm: 3.734 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 463/ 128728 | consumed samples: 7408 | consumed tokens: 15171584 | elapsed time per iteration (s): 15.24 | learning rate: 2.427E-06 | global batch size: 16 | lm loss: 8.633892E+00 | grad norm: 1.809 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 464/ 128728 | consumed samples: 7424 | consumed tokens: 15204352 | elapsed time per iteration (s): 15.24 | learning rate: 2.433E-06 | global batch size: 16 | lm loss: 8.328359E+00 | grad norm: 2.455 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 465/ 128728 | consumed samples: 7440 | consumed tokens: 15237120 | elapsed time per iteration (s): 15.23 | learning rate: 2.438E-06 | global batch size: 16 | lm loss: 8.553513E+00 | grad norm: 2.396 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 466/ 128728 | consumed samples: 7456 | consumed tokens: 15269888 | elapsed time per iteration (s): 15.24 | learning rate: 2.443E-06 | global batch size: 16 | lm loss: 8.325161E+00 | grad norm: 1.684 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 467/ 128728 | consumed samples: 7472 | consumed tokens: 15302656 | elapsed time per iteration (s): 15.25 | learning rate: 2.448E-06 | global batch size: 16 | lm loss: 8.422958E+00 | grad norm: 2.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 468/ 128728 | consumed samples: 7488 | consumed tokens: 15335424 | elapsed time per iteration (s): 15.24 | learning rate: 2.454E-06 | global batch size: 16 | lm loss: 8.435691E+00 | grad norm: 1.864 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 469/ 128728 | consumed samples: 7504 | consumed tokens: 15368192 | elapsed time per iteration (s): 15.25 | learning rate: 2.459E-06 | global batch size: 16 | lm loss: 8.224545E+00 | grad norm: 1.955 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 470/ 128728 | consumed samples: 7520 | consumed tokens: 15400960 | elapsed time per iteration (s): 15.23 | learning rate: 2.464E-06 | global batch size: 16 | lm loss: 8.631124E+00 | grad norm: 2.884 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 471/ 128728 | consumed samples: 7536 | consumed tokens: 15433728 | elapsed time per iteration (s): 15.23 | learning rate: 2.469E-06 | global batch size: 16 | lm loss: 8.445702E+00 | grad norm: 2.323 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 472/ 128728 | consumed samples: 7552 | consumed tokens: 15466496 | elapsed time per iteration (s): 15.24 | learning rate: 2.475E-06 | global batch size: 16 | lm loss: 8.381889E+00 | grad norm: 2.096 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 473/ 128728 | consumed samples: 7568 | consumed tokens: 15499264 | elapsed time per iteration (s): 15.21 | learning rate: 2.480E-06 | global batch size: 16 | lm loss: 8.303854E+00 | grad norm: 1.661 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 474/ 128728 | consumed samples: 7584 | consumed tokens: 15532032 | elapsed time per iteration (s): 15.24 | learning rate: 2.485E-06 | global batch size: 16 | lm loss: 8.326303E+00 | grad norm: 2.465 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 475/ 128728 | consumed samples: 7600 | consumed tokens: 15564800 | elapsed time per iteration (s): 15.18 | learning rate: 2.490E-06 | global batch size: 16 | lm loss: 8.428562E+00 | grad norm: 2.307 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 476/ 128728 | consumed samples: 7616 | consumed tokens: 15597568 | elapsed time per iteration (s): 15.19 | learning rate: 2.496E-06 | global batch size: 16 | lm loss: 8.343838E+00 | grad norm: 1.918 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 477/ 128728 | consumed samples: 7632 | consumed tokens: 15630336 | elapsed time per iteration (s): 15.23 | learning rate: 2.501E-06 | global batch size: 16 | lm loss: 8.380249E+00 | grad norm: 1.896 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 478/ 128728 | consumed samples: 7648 | consumed tokens: 15663104 | elapsed time per iteration (s): 15.25 | learning rate: 2.506E-06 | global batch size: 16 | lm loss: 8.442167E+00 | grad norm: 1.867 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 479/ 128728 | consumed samples: 7664 | consumed tokens: 15695872 | elapsed time per iteration (s): 15.26 | learning rate: 2.511E-06 | global batch size: 16 | lm loss: 8.244312E+00 | grad norm: 2.274 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 480/ 128728 | consumed samples: 7680 | consumed tokens: 15728640 | elapsed time per iteration (s): 15.22 | learning rate: 2.517E-06 | global batch size: 16 | lm loss: 8.509534E+00 | grad norm: 1.639 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 481/ 128728 | consumed samples: 7696 | consumed tokens: 15761408 | elapsed time per iteration (s): 15.26 | learning rate: 2.522E-06 | global batch size: 16 | lm loss: 8.340829E+00 | grad norm: 2.525 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 482/ 128728 | consumed samples: 7712 | consumed tokens: 15794176 | elapsed time per iteration (s): 15.23 | learning rate: 2.527E-06 | global batch size: 16 | lm loss: 8.174219E+00 | grad norm: 1.454 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 483/ 128728 | consumed samples: 7728 | consumed tokens: 15826944 | elapsed time per iteration (s): 15.23 | learning rate: 2.532E-06 | global batch size: 16 | lm loss: 8.252996E+00 | grad norm: 2.464 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 484/ 128728 | consumed samples: 7744 | consumed tokens: 15859712 | elapsed time per iteration (s): 15.25 | learning rate: 2.538E-06 | global batch size: 16 | lm loss: 8.682319E+00 | grad norm: 2.568 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 485/ 128728 | consumed samples: 7760 | consumed tokens: 15892480 | elapsed time per iteration (s): 15.23 | learning rate: 2.543E-06 | global batch size: 16 | lm loss: 8.436552E+00 | grad norm: 1.869 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 486/ 128728 | consumed samples: 7776 | consumed tokens: 15925248 | elapsed time per iteration (s): 15.22 | learning rate: 2.548E-06 | global batch size: 16 | lm loss: 8.348639E+00 | grad norm: 2.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 487/ 128728 | consumed samples: 7792 | consumed tokens: 15958016 | elapsed time per iteration (s): 15.25 | learning rate: 2.553E-06 | global batch size: 16 | lm loss: 8.289967E+00 | grad norm: 1.481 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 488/ 128728 | consumed samples: 7808 | consumed tokens: 15990784 | elapsed time per iteration (s): 15.24 | learning rate: 2.559E-06 | global batch size: 16 | lm loss: 8.350722E+00 | grad norm: 1.249 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 489/ 128728 | consumed samples: 7824 | consumed tokens: 16023552 | elapsed time per iteration (s): 15.24 | learning rate: 2.564E-06 | global batch size: 16 | lm loss: 8.134272E+00 | grad norm: 1.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 490/ 128728 | consumed samples: 7840 | consumed tokens: 16056320 | elapsed time per iteration (s): 15.21 | learning rate: 2.569E-06 | global batch size: 16 | lm loss: 8.318563E+00 | grad norm: 1.376 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 491/ 128728 | consumed samples: 7856 | consumed tokens: 16089088 | elapsed time per iteration (s): 15.23 | learning rate: 2.574E-06 | global batch size: 16 | lm loss: 8.154824E+00 | grad norm: 1.389 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 492/ 128728 | consumed samples: 7872 | consumed tokens: 16121856 | elapsed time per iteration (s): 15.19 | learning rate: 2.580E-06 | global batch size: 16 | lm loss: 8.418050E+00 | grad norm: 1.445 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 493/ 128728 | consumed samples: 7888 | consumed tokens: 16154624 | elapsed time per iteration (s): 15.20 | learning rate: 2.585E-06 | global batch size: 16 | lm loss: 8.386696E+00 | grad norm: 2.251 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 494/ 128728 | consumed samples: 7904 | consumed tokens: 16187392 | elapsed time per iteration (s): 15.21 | learning rate: 2.590E-06 | global batch size: 16 | lm loss: 8.342263E+00 | grad norm: 1.546 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 495/ 128728 | consumed samples: 7920 | consumed tokens: 16220160 | elapsed time per iteration (s): 15.23 | learning rate: 2.595E-06 | global batch size: 16 | lm loss: 8.309517E+00 | grad norm: 1.872 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 496/ 128728 | consumed samples: 7936 | consumed tokens: 16252928 | elapsed time per iteration (s): 15.26 | learning rate: 2.600E-06 | global batch size: 16 | lm loss: 8.248186E+00 | grad norm: 1.297 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 497/ 128728 | consumed samples: 7952 | consumed tokens: 16285696 | elapsed time per iteration (s): 15.23 | learning rate: 2.606E-06 | global batch size: 16 | lm loss: 8.194453E+00 | grad norm: 1.782 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 498/ 128728 | consumed samples: 7968 | consumed tokens: 16318464 | elapsed time per iteration (s): 15.25 | learning rate: 2.611E-06 | global batch size: 16 | lm loss: 8.389359E+00 | grad norm: 2.262 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 499/ 128728 | consumed samples: 7984 | consumed tokens: 16351232 | elapsed time per iteration (s): 15.22 | learning rate: 2.616E-06 | global batch size: 16 | lm loss: 8.140213E+00 | grad norm: 1.794 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 500/ 128728 | consumed samples: 8000 | consumed tokens: 16384000 | elapsed time per iteration (s): 15.26 | learning rate: 2.621E-06 | global batch size: 16 | lm loss: 8.574575E+00 | grad norm: 1.739 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default0]:saving checkpoint at iteration 500 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default0]:[2022-03-03 08:00:44,912] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/mp_rank_00_model_states.pt [default1]:[2022-03-03 08:00:45,317] [INFO] [logging.py:69:log_dist] [Rank 1] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/mp_rank_01_model_states.pt [default5]:[2022-03-03 08:01:22,351] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_41_optim_states.pt [default1]:[2022-03-03 08:01:25,322] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_21_optim_states.pt [default1]:[2022-03-03 08:01:25,803] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_33_optim_states.pt [default7]:[2022-03-03 08:01:25,871] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_43_optim_states.pt [default6]:[2022-03-03 08:01:27,645] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_34_optim_states.pt [default0]:[2022-03-03 08:01:27,898] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_28_optim_states.pt [default7]:[2022-03-03 08:01:28,198] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_35_optim_states.pt [default0]:[2022-03-03 08:01:28,281] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_32_optim_states.pt [default2]:[2022-03-03 08:01:28,299] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_34_optim_states.pt [default4]:[2022-03-03 08:01:28,335] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_32_optim_states.pt [default2]:[2022-03-03 08:01:28,410] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_30_optim_states.pt [default5]:[2022-03-03 08:01:28,477] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_33_optim_states.pt [default3]:[2022-03-03 08:01:28,483] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_35_optim_states.pt [default1]:[2022-03-03 08:01:28,611] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_37_optim_states.pt [default1]:[2022-03-03 08:01:28,834] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_29_optim_states.pt [default3]:[2022-03-03 08:01:28,866] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_35_optim_states.pt [default7]:[2022-03-03 08:01:29,048] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_47_optim_states.pt [default4]:[2022-03-03 08:01:29,140] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_32_optim_states.pt [default2]:[2022-03-03 08:01:29,246] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_34_optim_states.pt [default5]:[2022-03-03 08:01:29,220] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_29_optim_states.pt [default5]:[2022-03-03 08:01:29,418] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_33_optim_states.pt [default0]:[2022-03-03 08:01:29,436] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_32_optim_states.pt [default5]:[2022-03-03 08:01:29,424] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_41_optim_states.pt [default7]:[2022-03-03 08:01:29,515] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_35_optim_states.pt [default1]:[2022-03-03 08:01:29,670] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_33_optim_states.pt [default6]:[2022-03-03 08:01:29,851] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_42_optim_states.pt [default6]:[2022-03-03 08:01:29,915] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_34_optim_states.pt [default4]:[2022-03-03 08:01:30,179] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_28_optim_states.pt [default6]:[2022-03-03 08:01:30,143] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_46_optim_states.pt [default3]:[2022-03-03 08:01:30,649] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_31_optim_states.pt [default4]:[2022-03-03 08:01:30,649] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_40_optim_states.pt [default7]:[2022-03-03 08:01:31,128] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_31_optim_states.pt [default3]:[2022-03-03 08:01:31,087] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_43_optim_states.pt [default6]:[2022-03-03 08:01:31,158] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_30_optim_states.pt [default1]:[2022-03-03 08:01:31,296] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_45_optim_states.pt [default7]:[2022-03-03 08:01:31,537] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_43_optim_states.pt [default0]:[2022-03-03 08:01:32,144] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_36_optim_states.pt [default3]:[2022-03-03 08:01:32,336] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_47_optim_states.pt [default6]:[2022-03-03 08:01:32,413] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_46_optim_states.pt [default4]:[2022-03-03 08:01:32,460] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_36_optim_states.pt [default1]:[2022-03-03 08:01:32,379] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_45_optim_states.pt [default7]:[2022-03-03 08:01:32,453] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_47_optim_states.pt [default5]:[2022-03-03 08:01:32,501] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_05_optim_states.pt [default2]:[2022-03-03 08:01:32,451] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_42_optim_states.pt [default1]:[2022-03-03 08:01:32,526] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_37_optim_states.pt [default0]:[2022-03-03 08:01:32,526] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_40_optim_states.pt [default2]:[2022-03-03 08:01:32,471] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_42_optim_states.pt [default3]:[2022-03-03 08:01:32,518] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_39_optim_states.pt [default4]:[2022-03-03 08:01:32,578] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_36_optim_states.pt [default7]:[2022-03-03 08:01:32,567] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_39_optim_states.pt [default7]:[2022-03-03 08:01:32,550] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_31_optim_states.pt [default4]:[2022-03-03 08:01:32,662] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_04_optim_states.pt [default4]:[2022-03-03 08:01:32,656] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_08_optim_states.pt [default4]:[2022-03-03 08:01:32,605] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_40_optim_states.pt [default4]:[2022-03-03 08:01:32,652] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_08_optim_states.pt [default0]:[2022-03-03 08:01:32,598] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_36_optim_states.pt [default6]:[2022-03-03 08:01:32,705] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_38_optim_states.pt [default3]:[2022-03-03 08:01:32,675] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_43_optim_states.pt [default2]:[2022-03-03 08:01:32,714] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_30_optim_states.pt [default5]:[2022-03-03 08:01:32,730] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_45_optim_states.pt [default4]:[2022-03-03 08:01:32,726] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_16_optim_states.pt [default0]:[2022-03-03 08:01:32,732] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_04_optim_states.pt [default0]:[2022-03-03 08:01:32,719] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_16_optim_states.pt [default6]:[2022-03-03 08:01:32,688] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_42_optim_states.pt [default3]:[2022-03-03 08:01:32,683] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_39_optim_states.pt [default6]:[2022-03-03 08:01:32,727] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_30_optim_states.pt [default2]:[2022-03-03 08:01:32,709] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_42_optim_states.pt [default3]:[2022-03-03 08:01:32,773] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_31_optim_states.pt [default0]:[2022-03-03 08:01:32,764] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_24_optim_states.pt [default0]:[2022-03-03 08:01:32,758] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_16_optim_states.pt [default2]:[2022-03-03 08:01:32,788] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_26_optim_states.pt [default0]:[2022-03-03 08:01:32,726] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt [default5]:[2022-03-03 08:01:32,737] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_13_optim_states.pt [default6]:[2022-03-03 08:01:32,772] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_14_optim_states.pt [default6]:[2022-03-03 08:01:32,758] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_18_optim_states.pt [default2]:[2022-03-03 08:01:32,807] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_06_optim_states.pt [default3]:[2022-03-03 08:01:32,842] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_19_optim_states.pt [default0]:[2022-03-03 08:01:32,800] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_32_optim_states.pt [default1]:[2022-03-03 08:01:32,823] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_05_optim_states.pt [default5]:[2022-03-03 08:01:32,841] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_41_optim_states.pt [default4]:[2022-03-03 08:01:32,878] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_44_optim_states.pt [default7]:[2022-03-03 08:01:32,840] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_19_optim_states.pt [default3]:[2022-03-03 08:01:32,915] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_23_optim_states.pt [default1]:[2022-03-03 08:01:32,957] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_09_optim_states.pt [default5]:[2022-03-03 08:01:32,938] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_45_optim_states.pt [default5]:[2022-03-03 08:01:32,957] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_21_optim_states.pt [default6]:[2022-03-03 08:01:32,924] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_14_optim_states.pt [default1]:[2022-03-03 08:01:33,024] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_25_optim_states.pt [default7]:[2022-03-03 08:01:32,974] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_15_optim_states.pt [default2]:[2022-03-03 08:01:32,975] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_22_optim_states.pt [default3]:[2022-03-03 08:01:33,021] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_27_optim_states.pt [default0]:[2022-03-03 08:01:33,045] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_20_optim_states.pt [default0]:[2022-03-03 08:01:33,080] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_44_optim_states.pt [default2]:[2022-03-03 08:01:33,171] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_46_optim_states.pt [default0]:[2022-03-03 08:01:33,150] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_44_optim_states.pt [default0]:[2022-03-03 08:01:33,227] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_36_optim_states.pt [default2]:[2022-03-03 08:01:33,186] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_46_optim_states.pt [default4]:[2022-03-03 08:01:33,245] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_40_optim_states.pt [default3]:[2022-03-03 08:01:33,388] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_31_optim_states.pt [default5]:[2022-03-03 08:01:33,443] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_17_optim_states.pt [default3]:[2022-03-03 08:01:33,369] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_43_optim_states.pt [default4]:[2022-03-03 08:01:33,422] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_20_optim_states.pt [default3]:[2022-03-03 08:01:33,377] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_07_optim_states.pt [default4]:[2022-03-03 08:01:33,400] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_36_optim_states.pt [default2]:[2022-03-03 08:01:33,420] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_30_optim_states.pt [default3]:[2022-03-03 08:01:33,432] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_39_optim_states.pt [default1]:[2022-03-03 08:01:33,476] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_41_optim_states.pt [default5]:[2022-03-03 08:01:33,532] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_37_optim_states.pt [default1]:[2022-03-03 08:01:33,514] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_17_optim_states.pt [default0]:[2022-03-03 08:01:33,530] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_40_optim_states.pt [default7]:[2022-03-03 08:01:33,608] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_31_optim_states.pt [default4]:[2022-03-03 08:01:33,638] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_44_optim_states.pt [default7]:[2022-03-03 08:01:33,602] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_43_optim_states.pt [default1]:[2022-03-03 08:01:33,603] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_09_optim_states.pt [default1]:[2022-03-03 08:01:33,604] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_25_optim_states.pt [default6]:[2022-03-03 08:01:33,671] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_18_optim_states.pt [default6]:[2022-03-03 08:01:33,658] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_30_optim_states.pt [default3]:[2022-03-03 08:01:33,735] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_47_optim_states.pt [default7]:[2022-03-03 08:01:33,684] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_39_optim_states.pt [default6]:[2022-03-03 08:01:33,613] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_42_optim_states.pt [default2]:[2022-03-03 08:01:33,724] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_38_optim_states.pt [default3]:[2022-03-03 08:01:33,788] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_19_optim_states.pt [default2]:[2022-03-03 08:01:33,749] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_14_optim_states.pt [default2]:[2022-03-03 08:01:33,830] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_18_optim_states.pt [default1]:[2022-03-03 08:01:33,794] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_37_optim_states.pt [default0]:[2022-03-03 08:01:33,857] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_24_optim_states.pt [default6]:[2022-03-03 08:01:33,844] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_26_optim_states.pt [default2]:[2022-03-03 08:01:33,844] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_26_optim_states.pt [default2]:[2022-03-03 08:01:33,896] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_38_optim_states.pt [default1]:[2022-03-03 08:01:33,905] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_25_optim_states.pt [default5]:[2022-03-03 08:01:33,912] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_13_optim_states.pt [default3]:[2022-03-03 08:01:33,956] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_15_optim_states.pt [default4]:[2022-03-03 08:01:33,911] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_36_optim_states.pt [default6]:[2022-03-03 08:01:33,956] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_34_optim_states.pt [default3]:[2022-03-03 08:01:33,934] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_15_optim_states.pt [default3]:[2022-03-03 08:01:33,983] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_27_optim_states.pt [default1]:[2022-03-03 08:01:33,997] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_41_optim_states.pt [default5]:[2022-03-03 08:01:34,032] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_25_optim_states.pt [default6]:[2022-03-03 08:01:34,002] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_38_optim_states.pt [default3]:[2022-03-03 08:01:34,032] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_27_optim_states.pt [default2]:[2022-03-03 08:01:34,048] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_38_optim_states.pt [default1]:[2022-03-03 08:01:34,082] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_21_optim_states.pt [default0]:[2022-03-03 08:01:34,197] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [default4]:[2022-03-03 08:01:34,150] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_20_optim_states.pt [default3]:[2022-03-03 08:01:34,179] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_23_optim_states.pt [default5]:[2022-03-03 08:01:34,245] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_21_optim_states.pt [default1]:[2022-03-03 08:01:34,187] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_13_optim_states.pt [default7]:[2022-03-03 08:01:34,223] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_35_optim_states.pt [default6]:[2022-03-03 08:01:34,254] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_38_optim_states.pt [default7]:[2022-03-03 08:01:34,202] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_03_optim_states.pt [default3]:[2022-03-03 08:01:34,284] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_39_optim_states.pt [default1]:[2022-03-03 08:01:34,243] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_17_optim_states.pt [default5]:[2022-03-03 08:01:34,250] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_09_optim_states.pt [default7]:[2022-03-03 08:01:34,273] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_39_optim_states.pt [default4]:[2022-03-03 08:01:34,351] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_12_optim_states.pt [default1]:[2022-03-03 08:01:34,359] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_41_optim_states.pt [default2]:[2022-03-03 08:01:34,366] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_42_optim_states.pt [default0]:[2022-03-03 08:01:34,382] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_40_optim_states.pt [default7]:[2022-03-03 08:01:34,369] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_15_optim_states.pt [default0]:[2022-03-03 08:01:34,370] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_08_optim_states.pt [default0]:[2022-03-03 08:01:34,446] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_20_optim_states.pt [default3]:[2022-03-03 08:01:34,492] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_43_optim_states.pt [default1]:[2022-03-03 08:01:34,433] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_37_optim_states.pt [default5]:[2022-03-03 08:01:34,437] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_37_optim_states.pt [default5]:[2022-03-03 08:01:34,437] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_37_optim_states.pt [default4]:[2022-03-03 08:01:34,544] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_40_optim_states.pt [default5]:[2022-03-03 08:01:34,505] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_41_optim_states.pt [default3]:[2022-03-03 08:01:34,486] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_35_optim_states.pt [default7]:[2022-03-03 08:01:34,568] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_15_optim_states.pt [default7]:[2022-03-03 08:01:34,562] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_27_optim_states.pt [default2]:[2022-03-03 08:01:34,591] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_22_optim_states.pt [default1]:[2022-03-03 08:01:34,558] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_21_optim_states.pt [default6]:[2022-03-03 08:01:34,598] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_26_optim_states.pt [default0]:[2022-03-03 08:01:34,601] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_36_optim_states.pt [default0]:[2022-03-03 08:01:34,675] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_12_optim_states.pt [default5]:[2022-03-03 08:01:34,643] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_21_optim_states.pt [default2]:[2022-03-03 08:01:34,671] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_26_optim_states.pt [default2]:[2022-03-03 08:01:34,656] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_22_optim_states.pt [default0]:[2022-03-03 08:01:34,750] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_08_optim_states.pt [default4]:[2022-03-03 08:01:34,700] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_12_optim_states.pt [default2]:[2022-03-03 08:01:34,715] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_14_optim_states.pt [default3]:[2022-03-03 08:01:34,826] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_15_optim_states.pt [default7]:[2022-03-03 08:01:34,861] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_27_optim_states.pt [default5]:[2022-03-03 08:01:34,805] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_13_optim_states.pt [default6]:[2022-03-03 08:01:34,868] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_22_optim_states.pt [default7]:[2022-03-03 08:01:34,905] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_27_optim_states.pt [default7]:[2022-03-03 08:01:34,909] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_23_optim_states.pt [default5]:[2022-03-03 08:01:34,848] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_29_optim_states.pt [default3]:[2022-03-03 08:01:34,897] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_07_optim_states.pt [default6]:[2022-03-03 08:01:34,867] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_22_optim_states.pt [default0]:[2022-03-03 08:01:34,907] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_28_optim_states.pt [default4]:[2022-03-03 08:01:34,913] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_04_optim_states.pt [default5]:[2022-03-03 08:01:34,960] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_29_optim_states.pt [default5]:[2022-03-03 08:01:34,872] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt [default1]:[2022-03-03 08:01:34,923] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_29_optim_states.pt [default1]:[2022-03-03 08:01:34,983] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_05_optim_states.pt [default5]:[2022-03-03 08:01:34,980] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_29_optim_states.pt [default4]:[2022-03-03 08:01:34,991] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt [default1]:[2022-03-03 08:01:34,977] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_01_optim_states.pt [default6]:[2022-03-03 08:01:35,036] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_22_optim_states.pt [default5]:[2022-03-03 08:01:35,069] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_17_optim_states.pt [default4]:[2022-03-03 08:01:35,049] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_44_optim_states.pt [default1]:[2022-03-03 08:01:35,058] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_29_optim_states.pt [default5]:[2022-03-03 08:01:35,100] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_09_optim_states.pt [default1]:[2022-03-03 08:01:35,100] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_09_optim_states.pt [default0]:[2022-03-03 08:01:35,126] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_28_optim_states.pt [default3]:[2022-03-03 08:01:35,111] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_23_optim_states.pt [default2]:[2022-03-03 08:01:35,127] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_38_optim_states.pt [default0]:[2022-03-03 08:01:35,096] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_24_optim_states.pt [default0]:[2022-03-03 08:01:35,096] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_24_optim_states.pt [default5]:[2022-03-03 08:01:35,142] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_33_optim_states.pt [default0]:[2022-03-03 08:01:35,106] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_20_optim_states.pt [default2]:[2022-03-03 08:01:35,138] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_26_optim_states.pt [default3]:[2022-03-03 08:01:35,148] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_11_optim_states.pt [default6]:[2022-03-03 08:01:35,104] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_26_optim_states.pt [default6]:[2022-03-03 08:01:35,170] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_18_optim_states.pt [default1]:[2022-03-03 08:01:35,157] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_13_optim_states.pt [default2]:[2022-03-03 08:01:35,147] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_22_optim_states.pt [default7]:[2022-03-03 08:01:35,164] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_11_optim_states.pt [default4]:[2022-03-03 08:01:35,200] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_24_optim_states.pt [default1]:[2022-03-03 08:01:35,181] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_33_optim_states.pt [default6]:[2022-03-03 08:01:35,118] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_10_optim_states.pt [default1]:[2022-03-03 08:01:35,163] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_25_optim_states.pt [default3]:[2022-03-03 08:01:35,188] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_11_optim_states.pt [default2]:[2022-03-03 08:01:35,213] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_10_optim_states.pt [default4]:[2022-03-03 08:01:35,170] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_16_optim_states.pt [default6]:[2022-03-03 08:01:35,215] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_02_optim_states.pt [default6]:[2022-03-03 08:01:35,150] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_30_optim_states.pt [default0]:[2022-03-03 08:01:35,195] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_08_optim_states.pt [default3]:[2022-03-03 08:01:35,191] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_47_optim_states.pt [default6]:[2022-03-03 08:01:35,243] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_10_optim_states.pt [default6]:[2022-03-03 08:01:35,228] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_14_optim_states.pt [default4]:[2022-03-03 08:01:35,243] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_08_optim_states.pt [default2]:[2022-03-03 08:01:35,246] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_02_optim_states.pt [default4]:[2022-03-03 08:01:35,271] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_20_optim_states.pt [default4]:[2022-03-03 08:01:35,205] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_12_optim_states.pt [default2]:[2022-03-03 08:01:35,263] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_10_optim_states.pt [default5]:[2022-03-03 08:01:35,229] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_45_optim_states.pt [default3]:[2022-03-03 08:01:35,355] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_23_optim_states.pt [default5]:[2022-03-03 08:01:35,319] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_37_optim_states.pt [default0]:[2022-03-03 08:01:35,336] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_40_optim_states.pt [default0]:[2022-03-03 08:01:35,363] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_12_optim_states.pt [default2]:[2022-03-03 08:01:35,326] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_30_optim_states.pt [default0]:[2022-03-03 08:01:35,404] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_28_optim_states.pt [default6]:[2022-03-03 08:01:35,384] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_34_optim_states.pt [default4]:[2022-03-03 08:01:35,383] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_16_optim_states.pt [default1]:[2022-03-03 08:01:35,382] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_01_optim_states.pt [default1]:[2022-03-03 08:01:35,412] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_29_optim_states.pt [default3]:[2022-03-03 08:01:35,437] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_19_optim_states.pt [default2]:[2022-03-03 08:01:35,460] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_18_optim_states.pt [default7]:[2022-03-03 08:01:35,541] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_23_optim_states.pt [default0]:[2022-03-03 08:01:35,607] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_32_optim_states.pt [default4]:[2022-03-03 08:01:35,606] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_28_optim_states.pt [default2]:[2022-03-03 08:01:35,595] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_34_optim_states.pt [default1]:[2022-03-03 08:01:35,592] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_33_optim_states.pt [default5]:[2022-03-03 08:01:35,589] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_21_optim_states.pt [default4]:[2022-03-03 08:01:35,668] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_28_optim_states.pt [default2]:[2022-03-03 08:01:35,619] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_34_optim_states.pt [default2]:[2022-03-03 08:01:35,736] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_10_optim_states.pt [default4]:[2022-03-03 08:01:35,682] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_24_optim_states.pt [default7]:[2022-03-03 08:01:35,691] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_31_optim_states.pt [default1]:[2022-03-03 08:01:35,680] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_13_optim_states.pt [default3]:[2022-03-03 08:01:35,726] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_35_optim_states.pt [default2]:[2022-03-03 08:01:35,748] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_06_optim_states.pt [default7]:[2022-03-03 08:01:35,740] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_23_optim_states.pt [default7]:[2022-03-03 08:01:35,752] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_11_optim_states.pt [default6]:[2022-03-03 08:01:35,683] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_10_optim_states.pt [default0]:[2022-03-03 08:01:35,738] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_12_optim_states.pt [default7]:[2022-03-03 08:01:35,775] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_19_optim_states.pt [default7]:[2022-03-03 08:01:35,760] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_07_optim_states.pt [default5]:[2022-03-03 08:01:35,812] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_09_optim_states.pt [default5]:[2022-03-03 08:01:35,831] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_25_optim_states.pt [default4]:[2022-03-03 08:01:35,909] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_20_optim_states.pt [default6]:[2022-03-03 08:01:35,910] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_06_optim_states.pt [default5]:[2022-03-03 08:01:35,898] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_09_optim_states.pt [default5]:[2022-03-03 08:01:35,914] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_01_optim_states.pt [default2]:[2022-03-03 08:01:35,876] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_02_optim_states.pt [default7]:[2022-03-03 08:01:35,867] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_39_optim_states.pt [default2]:[2022-03-03 08:01:35,903] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_14_optim_states.pt [default3]:[2022-03-03 08:01:35,984] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_11_optim_states.pt [default4]:[2022-03-03 08:01:35,976] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_32_optim_states.pt [default3]:[2022-03-03 08:01:35,943] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_03_optim_states.pt [default7]:[2022-03-03 08:01:35,928] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_07_optim_states.pt [default7]:[2022-03-03 08:01:35,945] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_11_optim_states.pt [default7]:[2022-03-03 08:01:35,991] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_27_optim_states.pt [default6]:[2022-03-03 08:01:35,924] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_38_optim_states.pt [default5]:[2022-03-03 08:01:36,100] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_25_optim_states.pt [default6]:[2022-03-03 08:01:36,064] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_26_optim_states.pt [default4]:[2022-03-03 08:01:36,159] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_32_optim_states.pt [default7]:[2022-03-03 08:01:36,080] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_35_optim_states.pt [default1]:[2022-03-03 08:01:36,163] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_09_optim_states.pt [default4]:[2022-03-03 08:01:36,174] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_08_optim_states.pt [default7]:[2022-03-03 08:01:36,340] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_19_optim_states.pt [default4]:[2022-03-03 08:01:36,402] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt [default6]:[2022-03-03 08:01:36,428] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_02_optim_states.pt [default4]:[2022-03-03 08:01:36,448] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_28_optim_states.pt [default3]:[2022-03-03 08:01:36,478] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_31_optim_states.pt [default5]:[2022-03-03 08:01:36,523] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_33_optim_states.pt [default6]:[2022-03-03 08:01:36,542] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_02_optim_states.pt [default7]:[2022-03-03 08:01:36,515] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_03_optim_states.pt [default3]:[2022-03-03 08:01:36,603] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_11_optim_states.pt [default3]:[2022-03-03 08:01:36,668] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_27_optim_states.pt [default2]:[2022-03-03 08:01:36,618] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_06_optim_states.pt [default3]:[2022-03-03 08:01:36,628] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_47_optim_states.pt [default3]:[2022-03-03 08:01:36,704] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_07_optim_states.pt [default6]:[2022-03-03 08:01:36,706] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_22_optim_states.pt [default5]:[2022-03-03 08:01:36,636] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_17_optim_states.pt [default1]:[2022-03-03 08:01:36,653] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_45_optim_states.pt [default1]:[2022-03-03 08:01:36,742] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_41_optim_states.pt [default3]:[2022-03-03 08:01:36,729] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_03_optim_states.pt [default0]:[2022-03-03 08:01:36,749] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_04_optim_states.pt [default0]:[2022-03-03 08:01:36,708] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_20_optim_states.pt [default7]:[2022-03-03 08:01:36,710] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_47_optim_states.pt [default0]:[2022-03-03 08:01:36,763] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_16_optim_states.pt [default2]:[2022-03-03 08:01:36,777] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_46_optim_states.pt [default2]:[2022-03-03 08:01:36,804] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_10_optim_states.pt [default7]:[2022-03-03 08:01:36,744] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_23_optim_states.pt [default4]:[2022-03-03 08:01:36,770] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_24_optim_states.pt [default0]:[2022-03-03 08:01:36,806] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_16_optim_states.pt [default2]:[2022-03-03 08:01:36,798] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_02_optim_states.pt [default0]:[2022-03-03 08:01:36,826] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt [default4]:[2022-03-03 08:01:36,897] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_44_optim_states.pt [default6]:[2022-03-03 08:01:36,916] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_46_optim_states.pt [default0]:[2022-03-03 08:01:36,852] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_08_optim_states.pt [default7]:[2022-03-03 08:01:36,869] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_19_optim_states.pt [default7]:[2022-03-03 08:01:36,922] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_03_optim_states.pt [default1]:[2022-03-03 08:01:36,976] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_01_optim_states.pt [default6]:[2022-03-03 08:01:37,017] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_02_optim_states.pt [default0]:[2022-03-03 08:01:36,971] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_04_optim_states.pt [default1]:[2022-03-03 08:01:36,952] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_21_optim_states.pt [default2]:[2022-03-03 08:01:36,984] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_0_mp_rank_06_optim_states.pt [default2]:[2022-03-03 08:01:36,992] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_18_optim_states.pt [default5]:[2022-03-03 08:01:37,010] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_05_optim_states.pt [default1]:[2022-03-03 08:01:37,028] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_17_optim_states.pt [default5]:[2022-03-03 08:01:37,069] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_45_optim_states.pt [default7]:[2022-03-03 08:01:37,151] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_43_optim_states.pt [default6]:[2022-03-03 08:01:37,335] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_10_optim_states.pt [default1]:[2022-03-03 08:01:37,282] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_17_optim_states.pt [default6]:[2022-03-03 08:01:37,447] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_06_optim_states.pt [default2]:[2022-03-03 08:01:37,369] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_18_optim_states.pt [default7]:[2022-03-03 08:01:37,394] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_11_optim_states.pt [default0]:[2022-03-03 08:01:37,490] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_44_optim_states.pt [default4]:[2022-03-03 08:01:37,577] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_24_optim_states.pt [default2]:[2022-03-03 08:01:37,646] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_46_optim_states.pt [default6]:[2022-03-03 08:01:37,627] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_18_optim_states.pt [default5]:[2022-03-03 08:01:37,718] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_01_optim_states.pt [default6]:[2022-03-03 08:01:37,796] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_42_optim_states.pt [default3]:[2022-03-03 08:01:37,817] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_19_optim_states.pt [default3]:[2022-03-03 08:01:37,804] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_03_optim_states.pt [default4]:[2022-03-03 08:01:37,892] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt [default7]:[2022-03-03 08:01:37,994] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_1_mp_rank_07_optim_states.pt [default6]:[2022-03-03 08:01:38,149] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_06_optim_states.pt [default3]:[2022-03-03 08:01:38,174] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_07_optim_states.pt [default1]:[2022-03-03 08:01:38,175] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_05_optim_states.pt [default4]:[2022-03-03 08:01:38,177] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_12_optim_states.pt [default5]:[2022-03-03 08:01:38,168] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_25_optim_states.pt [default1]:[2022-03-03 08:01:38,274] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_05_optim_states.pt [default6]:[2022-03-03 08:01:38,229] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_06_optim_states.pt [default0]:[2022-03-03 08:01:38,246] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_04_optim_states.pt [default1]:[2022-03-03 08:01:38,255] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_01_optim_states.pt [default0]:[2022-03-03 08:01:38,318] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt [default7]:[2022-03-03 08:01:38,293] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_15_optim_states.pt [default5]:[2022-03-03 08:01:38,290] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_13_optim_states.pt [default7]:[2022-03-03 08:01:38,432] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_03_optim_states.pt [default4]:[2022-03-03 08:01:38,362] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt [default6]:[2022-03-03 08:01:38,472] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_14_optim_states.pt [default5]:[2022-03-03 08:01:38,587] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_5_mp_rank_01_optim_states.pt [default0]:[2022-03-03 08:01:38,528] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_44_optim_states.pt [default5]:[2022-03-03 08:01:38,604] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_05_optim_states.pt [default5]:[2022-03-03 08:01:38,600] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_05_optim_states.pt [default1]:[2022-03-03 08:01:38,683] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_6_mp_rank_45_optim_states.pt [default4]:[2022-03-03 08:01:38,703] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_04_optim_states.pt [default7]:[2022-03-03 08:01:38,748] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_3_mp_rank_07_optim_states.pt [default4]:[2022-03-03 08:01:38,703] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_04_optim_states.pt [default2]:[2022-03-03 08:01:39,051] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_02_optim_states.pt [default3]:[2022-03-03 08:01:39,047] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_03_optim_states.pt [default6]:[2022-03-03 08:01:40,428] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_46_optim_states.pt [default7]:[2022-03-03 08:01:40,521] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_47_optim_states.pt [default4]:[2022-03-03 08:01:40,869] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_16_optim_states.pt [default5]:[2022-03-03 08:01:41,456] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_7_mp_rank_17_optim_states.pt [default2]:[2022-03-03 08:01:41,603] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_14_optim_states.pt [default3]:[2022-03-03 08:01:41,647] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_15_optim_states.pt [default0]: successfully saved checkpoint at iteration 500 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default0]:[2022-03-03 08:01:42,080] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_12_optim_states.pt [default1]:[2022-03-03 08:01:42,067] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step500/bf16_zero_pp_rank_4_mp_rank_13_optim_states.pt [default7]:time (ms) | save-checkpoint: 67248.11 [default7]: iteration 501/ 128728 | consumed samples: 8016 | consumed tokens: 16416768 | elapsed time per iteration (s): 82.50 | learning rate: 2.627E-06 | global batch size: 16 | lm loss: 8.488214E+00 | grad norm: 1.406 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 0.194 | TFLOPs: 1.48 | [default7]: iteration 502/ 128728 | consumed samples: 8032 | consumed tokens: 16449536 | elapsed time per iteration (s): 15.23 | learning rate: 2.632E-06 | global batch size: 16 | lm loss: 8.423536E+00 | grad norm: 2.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 503/ 128728 | consumed samples: 8048 | consumed tokens: 16482304 | elapsed time per iteration (s): 15.22 | learning rate: 2.637E-06 | global batch size: 16 | lm loss: 8.185781E+00 | grad norm: 1.556 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 504/ 128728 | consumed samples: 8064 | consumed tokens: 16515072 | elapsed time per iteration (s): 15.24 | learning rate: 2.642E-06 | global batch size: 16 | lm loss: 8.098662E+00 | grad norm: 2.003 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 505/ 128728 | consumed samples: 8080 | consumed tokens: 16547840 | elapsed time per iteration (s): 15.26 | learning rate: 2.648E-06 | global batch size: 16 | lm loss: 8.311114E+00 | grad norm: 1.880 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 506/ 128728 | consumed samples: 8096 | consumed tokens: 16580608 | elapsed time per iteration (s): 15.26 | learning rate: 2.653E-06 | global batch size: 16 | lm loss: 8.314884E+00 | grad norm: 1.417 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 507/ 128728 | consumed samples: 8112 | consumed tokens: 16613376 | elapsed time per iteration (s): 15.21 | learning rate: 2.658E-06 | global batch size: 16 | lm loss: 8.274142E+00 | grad norm: 1.901 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 508/ 128728 | consumed samples: 8128 | consumed tokens: 16646144 | elapsed time per iteration (s): 15.22 | learning rate: 2.663E-06 | global batch size: 16 | lm loss: 8.300067E+00 | grad norm: 1.870 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 509/ 128728 | consumed samples: 8144 | consumed tokens: 16678912 | elapsed time per iteration (s): 15.24 | learning rate: 2.669E-06 | global batch size: 16 | lm loss: 8.125998E+00 | grad norm: 1.942 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 510/ 128728 | consumed samples: 8160 | consumed tokens: 16711680 | elapsed time per iteration (s): 15.21 | learning rate: 2.674E-06 | global batch size: 16 | lm loss: 8.157375E+00 | grad norm: 1.691 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 511/ 128728 | consumed samples: 8176 | consumed tokens: 16744448 | elapsed time per iteration (s): 15.26 | learning rate: 2.679E-06 | global batch size: 16 | lm loss: 8.114425E+00 | grad norm: 2.584 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 512/ 128728 | consumed samples: 8192 | consumed tokens: 16777216 | elapsed time per iteration (s): 15.24 | learning rate: 2.684E-06 | global batch size: 16 | lm loss: 8.181797E+00 | grad norm: 1.527 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 513/ 128728 | consumed samples: 8208 | consumed tokens: 16809984 | elapsed time per iteration (s): 15.19 | learning rate: 2.690E-06 | global batch size: 16 | lm loss: 8.276696E+00 | grad norm: 2.095 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 514/ 128728 | consumed samples: 8224 | consumed tokens: 16842752 | elapsed time per iteration (s): 15.18 | learning rate: 2.695E-06 | global batch size: 16 | lm loss: 8.265854E+00 | grad norm: 2.106 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 515/ 128728 | consumed samples: 8240 | consumed tokens: 16875520 | elapsed time per iteration (s): 15.18 | learning rate: 2.700E-06 | global batch size: 16 | lm loss: 8.100229E+00 | grad norm: 1.676 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 516/ 128728 | consumed samples: 8256 | consumed tokens: 16908288 | elapsed time per iteration (s): 15.21 | learning rate: 2.705E-06 | global batch size: 16 | lm loss: 8.021216E+00 | grad norm: 1.580 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 517/ 128728 | consumed samples: 8272 | consumed tokens: 16941056 | elapsed time per iteration (s): 15.20 | learning rate: 2.711E-06 | global batch size: 16 | lm loss: 8.086869E+00 | grad norm: 1.505 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 518/ 128728 | consumed samples: 8288 | consumed tokens: 16973824 | elapsed time per iteration (s): 15.26 | learning rate: 2.716E-06 | global batch size: 16 | lm loss: 8.120964E+00 | grad norm: 2.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 519/ 128728 | consumed samples: 8304 | consumed tokens: 17006592 | elapsed time per iteration (s): 15.25 | learning rate: 2.721E-06 | global batch size: 16 | lm loss: 8.232798E+00 | grad norm: 1.930 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 520/ 128728 | consumed samples: 8320 | consumed tokens: 17039360 | elapsed time per iteration (s): 15.23 | learning rate: 2.726E-06 | global batch size: 16 | lm loss: 8.287365E+00 | grad norm: 1.695 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 521/ 128728 | consumed samples: 8336 | consumed tokens: 17072128 | elapsed time per iteration (s): 15.23 | learning rate: 2.732E-06 | global batch size: 16 | lm loss: 8.058668E+00 | grad norm: 2.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 522/ 128728 | consumed samples: 8352 | consumed tokens: 17104896 | elapsed time per iteration (s): 15.19 | learning rate: 2.737E-06 | global batch size: 16 | lm loss: 7.900158E+00 | grad norm: 2.065 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 523/ 128728 | consumed samples: 8368 | consumed tokens: 17137664 | elapsed time per iteration (s): 15.24 | learning rate: 2.742E-06 | global batch size: 16 | lm loss: 8.412863E+00 | grad norm: 1.729 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 524/ 128728 | consumed samples: 8384 | consumed tokens: 17170432 | elapsed time per iteration (s): 15.23 | learning rate: 2.747E-06 | global batch size: 16 | lm loss: 8.102924E+00 | grad norm: 2.723 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 525/ 128728 | consumed samples: 8400 | consumed tokens: 17203200 | elapsed time per iteration (s): 15.24 | learning rate: 2.753E-06 | global batch size: 16 | lm loss: 7.951356E+00 | grad norm: 1.575 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 526/ 128728 | consumed samples: 8416 | consumed tokens: 17235968 | elapsed time per iteration (s): 15.27 | learning rate: 2.758E-06 | global batch size: 16 | lm loss: 8.285418E+00 | grad norm: 2.349 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 527/ 128728 | consumed samples: 8432 | consumed tokens: 17268736 | elapsed time per iteration (s): 15.21 | learning rate: 2.763E-06 | global batch size: 16 | lm loss: 8.269984E+00 | grad norm: 2.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 528/ 128728 | consumed samples: 8448 | consumed tokens: 17301504 | elapsed time per iteration (s): 15.26 | learning rate: 2.768E-06 | global batch size: 16 | lm loss: 8.237260E+00 | grad norm: 2.085 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 529/ 128728 | consumed samples: 8464 | consumed tokens: 17334272 | elapsed time per iteration (s): 15.25 | learning rate: 2.773E-06 | global batch size: 16 | lm loss: 8.148373E+00 | grad norm: 2.296 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 530/ 128728 | consumed samples: 8480 | consumed tokens: 17367040 | elapsed time per iteration (s): 15.20 | learning rate: 2.779E-06 | global batch size: 16 | lm loss: 8.244123E+00 | grad norm: 1.654 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 531/ 128728 | consumed samples: 8496 | consumed tokens: 17399808 | elapsed time per iteration (s): 15.16 | learning rate: 2.784E-06 | global batch size: 16 | lm loss: 8.061798E+00 | grad norm: 1.231 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 532/ 128728 | consumed samples: 8512 | consumed tokens: 17432576 | elapsed time per iteration (s): 15.19 | learning rate: 2.789E-06 | global batch size: 16 | lm loss: 8.042222E+00 | grad norm: 1.489 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 533/ 128728 | consumed samples: 8528 | consumed tokens: 17465344 | elapsed time per iteration (s): 15.18 | learning rate: 2.794E-06 | global batch size: 16 | lm loss: 8.086902E+00 | grad norm: 1.756 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 534/ 128728 | consumed samples: 8544 | consumed tokens: 17498112 | elapsed time per iteration (s): 15.23 | learning rate: 2.800E-06 | global batch size: 16 | lm loss: 8.083276E+00 | grad norm: 1.065 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 535/ 128728 | consumed samples: 8560 | consumed tokens: 17530880 | elapsed time per iteration (s): 15.18 | learning rate: 2.805E-06 | global batch size: 16 | lm loss: 8.244881E+00 | grad norm: 1.885 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 536/ 128728 | consumed samples: 8576 | consumed tokens: 17563648 | elapsed time per iteration (s): 15.18 | learning rate: 2.810E-06 | global batch size: 16 | lm loss: 8.199797E+00 | grad norm: 1.813 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 537/ 128728 | consumed samples: 8592 | consumed tokens: 17596416 | elapsed time per iteration (s): 15.27 | learning rate: 2.815E-06 | global batch size: 16 | lm loss: 8.002762E+00 | grad norm: 1.044 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 538/ 128728 | consumed samples: 8608 | consumed tokens: 17629184 | elapsed time per iteration (s): 15.28 | learning rate: 2.821E-06 | global batch size: 16 | lm loss: 8.290606E+00 | grad norm: 4.069 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 539/ 128728 | consumed samples: 8624 | consumed tokens: 17661952 | elapsed time per iteration (s): 15.26 | learning rate: 2.826E-06 | global batch size: 16 | lm loss: 7.995849E+00 | grad norm: 1.941 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 540/ 128728 | consumed samples: 8640 | consumed tokens: 17694720 | elapsed time per iteration (s): 15.26 | learning rate: 2.831E-06 | global batch size: 16 | lm loss: 8.186256E+00 | grad norm: 1.338 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 541/ 128728 | consumed samples: 8656 | consumed tokens: 17727488 | elapsed time per iteration (s): 15.18 | learning rate: 2.836E-06 | global batch size: 16 | lm loss: 8.296293E+00 | grad norm: 2.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 542/ 128728 | consumed samples: 8672 | consumed tokens: 17760256 | elapsed time per iteration (s): 15.25 | learning rate: 2.842E-06 | global batch size: 16 | lm loss: 8.072968E+00 | grad norm: 2.974 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 543/ 128728 | consumed samples: 8688 | consumed tokens: 17793024 | elapsed time per iteration (s): 15.20 | learning rate: 2.847E-06 | global batch size: 16 | lm loss: 8.082905E+00 | grad norm: 2.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 544/ 128728 | consumed samples: 8704 | consumed tokens: 17825792 | elapsed time per iteration (s): 15.23 | learning rate: 2.852E-06 | global batch size: 16 | lm loss: 8.032642E+00 | grad norm: 2.560 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 545/ 128728 | consumed samples: 8720 | consumed tokens: 17858560 | elapsed time per iteration (s): 15.22 | learning rate: 2.857E-06 | global batch size: 16 | lm loss: 8.391273E+00 | grad norm: 2.395 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 546/ 128728 | consumed samples: 8736 | consumed tokens: 17891328 | elapsed time per iteration (s): 15.22 | learning rate: 2.863E-06 | global batch size: 16 | lm loss: 8.539359E+00 | grad norm: 4.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 547/ 128728 | consumed samples: 8752 | consumed tokens: 17924096 | elapsed time per iteration (s): 15.23 | learning rate: 2.868E-06 | global batch size: 16 | lm loss: 8.038402E+00 | grad norm: 3.554 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 548/ 128728 | consumed samples: 8768 | consumed tokens: 17956864 | elapsed time per iteration (s): 15.24 | learning rate: 2.873E-06 | global batch size: 16 | lm loss: 8.210316E+00 | grad norm: 1.668 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 549/ 128728 | consumed samples: 8784 | consumed tokens: 17989632 | elapsed time per iteration (s): 15.27 | learning rate: 2.878E-06 | global batch size: 16 | lm loss: 8.174568E+00 | grad norm: 1.671 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 550/ 128728 | consumed samples: 8800 | consumed tokens: 18022400 | elapsed time per iteration (s): 15.25 | learning rate: 2.884E-06 | global batch size: 16 | lm loss: 8.142506E+00 | grad norm: 2.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 551/ 128728 | consumed samples: 8816 | consumed tokens: 18055168 | elapsed time per iteration (s): 15.23 | learning rate: 2.889E-06 | global batch size: 16 | lm loss: 8.139000E+00 | grad norm: 1.684 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 552/ 128728 | consumed samples: 8832 | consumed tokens: 18087936 | elapsed time per iteration (s): 15.22 | learning rate: 2.894E-06 | global batch size: 16 | lm loss: 8.084474E+00 | grad norm: 1.485 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 553/ 128728 | consumed samples: 8848 | consumed tokens: 18120704 | elapsed time per iteration (s): 15.27 | learning rate: 2.899E-06 | global batch size: 16 | lm loss: 8.098001E+00 | grad norm: 1.724 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 554/ 128728 | consumed samples: 8864 | consumed tokens: 18153472 | elapsed time per iteration (s): 15.23 | learning rate: 2.905E-06 | global batch size: 16 | lm loss: 8.071024E+00 | grad norm: 1.741 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 555/ 128728 | consumed samples: 8880 | consumed tokens: 18186240 | elapsed time per iteration (s): 15.23 | learning rate: 2.910E-06 | global batch size: 16 | lm loss: 8.011195E+00 | grad norm: 1.380 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 556/ 128728 | consumed samples: 8896 | consumed tokens: 18219008 | elapsed time per iteration (s): 15.24 | learning rate: 2.915E-06 | global batch size: 16 | lm loss: 8.171795E+00 | grad norm: 1.824 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 557/ 128728 | consumed samples: 8912 | consumed tokens: 18251776 | elapsed time per iteration (s): 15.24 | learning rate: 2.920E-06 | global batch size: 16 | lm loss: 8.022076E+00 | grad norm: 1.831 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 558/ 128728 | consumed samples: 8928 | consumed tokens: 18284544 | elapsed time per iteration (s): 15.22 | learning rate: 2.926E-06 | global batch size: 16 | lm loss: 7.988214E+00 | grad norm: 1.804 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 559/ 128728 | consumed samples: 8944 | consumed tokens: 18317312 | elapsed time per iteration (s): 15.23 | learning rate: 2.931E-06 | global batch size: 16 | lm loss: 7.990775E+00 | grad norm: 1.640 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 560/ 128728 | consumed samples: 8960 | consumed tokens: 18350080 | elapsed time per iteration (s): 15.24 | learning rate: 2.936E-06 | global batch size: 16 | lm loss: 8.082418E+00 | grad norm: 1.921 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 561/ 128728 | consumed samples: 8976 | consumed tokens: 18382848 | elapsed time per iteration (s): 15.24 | learning rate: 2.941E-06 | global batch size: 16 | lm loss: 8.083212E+00 | grad norm: 2.008 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 562/ 128728 | consumed samples: 8992 | consumed tokens: 18415616 | elapsed time per iteration (s): 15.25 | learning rate: 2.947E-06 | global batch size: 16 | lm loss: 7.988510E+00 | grad norm: 1.481 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 563/ 128728 | consumed samples: 9008 | consumed tokens: 18448384 | elapsed time per iteration (s): 15.26 | learning rate: 2.952E-06 | global batch size: 16 | lm loss: 8.018039E+00 | grad norm: 2.576 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 564/ 128728 | consumed samples: 9024 | consumed tokens: 18481152 | elapsed time per iteration (s): 15.22 | learning rate: 2.957E-06 | global batch size: 16 | lm loss: 8.159368E+00 | grad norm: 1.826 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 565/ 128728 | consumed samples: 9040 | consumed tokens: 18513920 | elapsed time per iteration (s): 15.24 | learning rate: 2.962E-06 | global batch size: 16 | lm loss: 8.076411E+00 | grad norm: 2.712 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 566/ 128728 | consumed samples: 9056 | consumed tokens: 18546688 | elapsed time per iteration (s): 15.24 | learning rate: 2.967E-06 | global batch size: 16 | lm loss: 8.065808E+00 | grad norm: 1.326 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 567/ 128728 | consumed samples: 9072 | consumed tokens: 18579456 | elapsed time per iteration (s): 15.24 | learning rate: 2.973E-06 | global batch size: 16 | lm loss: 8.268667E+00 | grad norm: 2.372 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 568/ 128728 | consumed samples: 9088 | consumed tokens: 18612224 | elapsed time per iteration (s): 15.21 | learning rate: 2.978E-06 | global batch size: 16 | lm loss: 8.158611E+00 | grad norm: 2.448 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 569/ 128728 | consumed samples: 9104 | consumed tokens: 18644992 | elapsed time per iteration (s): 15.19 | learning rate: 2.983E-06 | global batch size: 16 | lm loss: 8.178822E+00 | grad norm: 1.379 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 570/ 128728 | consumed samples: 9120 | consumed tokens: 18677760 | elapsed time per iteration (s): 15.23 | learning rate: 2.988E-06 | global batch size: 16 | lm loss: 8.175869E+00 | grad norm: 1.809 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 571/ 128728 | consumed samples: 9136 | consumed tokens: 18710528 | elapsed time per iteration (s): 15.24 | learning rate: 2.994E-06 | global batch size: 16 | lm loss: 8.093798E+00 | grad norm: 1.806 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 572/ 128728 | consumed samples: 9152 | consumed tokens: 18743296 | elapsed time per iteration (s): 15.22 | learning rate: 2.999E-06 | global batch size: 16 | lm loss: 8.181890E+00 | grad norm: 2.048 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 573/ 128728 | consumed samples: 9168 | consumed tokens: 18776064 | elapsed time per iteration (s): 15.25 | learning rate: 3.004E-06 | global batch size: 16 | lm loss: 8.041045E+00 | grad norm: 2.266 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 574/ 128728 | consumed samples: 9184 | consumed tokens: 18808832 | elapsed time per iteration (s): 15.23 | learning rate: 3.009E-06 | global batch size: 16 | lm loss: 8.138422E+00 | grad norm: 3.221 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 575/ 128728 | consumed samples: 9200 | consumed tokens: 18841600 | elapsed time per iteration (s): 15.22 | learning rate: 3.015E-06 | global batch size: 16 | lm loss: 8.045207E+00 | grad norm: 2.058 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 576/ 128728 | consumed samples: 9216 | consumed tokens: 18874368 | elapsed time per iteration (s): 15.22 | learning rate: 3.020E-06 | global batch size: 16 | lm loss: 7.972528E+00 | grad norm: 1.714 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 577/ 128728 | consumed samples: 9232 | consumed tokens: 18907136 | elapsed time per iteration (s): 15.25 | learning rate: 3.025E-06 | global batch size: 16 | lm loss: 8.178508E+00 | grad norm: 1.988 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 578/ 128728 | consumed samples: 9248 | consumed tokens: 18939904 | elapsed time per iteration (s): 15.24 | learning rate: 3.030E-06 | global batch size: 16 | lm loss: 7.980485E+00 | grad norm: 1.631 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 579/ 128728 | consumed samples: 9264 | consumed tokens: 18972672 | elapsed time per iteration (s): 15.24 | learning rate: 3.036E-06 | global batch size: 16 | lm loss: 7.864195E+00 | grad norm: 1.538 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 580/ 128728 | consumed samples: 9280 | consumed tokens: 19005440 | elapsed time per iteration (s): 15.25 | learning rate: 3.041E-06 | global batch size: 16 | lm loss: 8.087688E+00 | grad norm: 2.799 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 581/ 128728 | consumed samples: 9296 | consumed tokens: 19038208 | elapsed time per iteration (s): 15.22 | learning rate: 3.046E-06 | global batch size: 16 | lm loss: 8.038260E+00 | grad norm: 1.989 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 582/ 128728 | consumed samples: 9312 | consumed tokens: 19070976 | elapsed time per iteration (s): 15.25 | learning rate: 3.051E-06 | global batch size: 16 | lm loss: 7.954132E+00 | grad norm: 1.389 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 583/ 128728 | consumed samples: 9328 | consumed tokens: 19103744 | elapsed time per iteration (s): 15.18 | learning rate: 3.057E-06 | global batch size: 16 | lm loss: 8.152493E+00 | grad norm: 2.484 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 584/ 128728 | consumed samples: 9344 | consumed tokens: 19136512 | elapsed time per iteration (s): 15.19 | learning rate: 3.062E-06 | global batch size: 16 | lm loss: 8.236040E+00 | grad norm: 1.697 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 585/ 128728 | consumed samples: 9360 | consumed tokens: 19169280 | elapsed time per iteration (s): 15.17 | learning rate: 3.067E-06 | global batch size: 16 | lm loss: 7.907086E+00 | grad norm: 2.350 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 586/ 128728 | consumed samples: 9376 | consumed tokens: 19202048 | elapsed time per iteration (s): 15.19 | learning rate: 3.072E-06 | global batch size: 16 | lm loss: 8.304672E+00 | grad norm: 1.986 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 587/ 128728 | consumed samples: 9392 | consumed tokens: 19234816 | elapsed time per iteration (s): 15.24 | learning rate: 3.078E-06 | global batch size: 16 | lm loss: 8.053318E+00 | grad norm: 1.843 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 588/ 128728 | consumed samples: 9408 | consumed tokens: 19267584 | elapsed time per iteration (s): 15.24 | learning rate: 3.083E-06 | global batch size: 16 | lm loss: 8.005896E+00 | grad norm: 1.890 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 589/ 128728 | consumed samples: 9424 | consumed tokens: 19300352 | elapsed time per iteration (s): 15.26 | learning rate: 3.088E-06 | global batch size: 16 | lm loss: 7.824888E+00 | grad norm: 1.580 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 590/ 128728 | consumed samples: 9440 | consumed tokens: 19333120 | elapsed time per iteration (s): 15.23 | learning rate: 3.093E-06 | global batch size: 16 | lm loss: 8.009818E+00 | grad norm: 1.547 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 591/ 128728 | consumed samples: 9456 | consumed tokens: 19365888 | elapsed time per iteration (s): 15.22 | learning rate: 3.099E-06 | global batch size: 16 | lm loss: 7.998293E+00 | grad norm: 1.696 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 592/ 128728 | consumed samples: 9472 | consumed tokens: 19398656 | elapsed time per iteration (s): 15.27 | learning rate: 3.104E-06 | global batch size: 16 | lm loss: 8.065947E+00 | grad norm: 1.816 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 593/ 128728 | consumed samples: 9488 | consumed tokens: 19431424 | elapsed time per iteration (s): 15.27 | learning rate: 3.109E-06 | global batch size: 16 | lm loss: 7.924274E+00 | grad norm: 1.954 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 594/ 128728 | consumed samples: 9504 | consumed tokens: 19464192 | elapsed time per iteration (s): 15.22 | learning rate: 3.114E-06 | global batch size: 16 | lm loss: 7.962350E+00 | grad norm: 1.735 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 595/ 128728 | consumed samples: 9520 | consumed tokens: 19496960 | elapsed time per iteration (s): 15.23 | learning rate: 3.120E-06 | global batch size: 16 | lm loss: 8.032861E+00 | grad norm: 3.286 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 596/ 128728 | consumed samples: 9536 | consumed tokens: 19529728 | elapsed time per iteration (s): 15.27 | learning rate: 3.125E-06 | global batch size: 16 | lm loss: 8.025990E+00 | grad norm: 3.459 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 597/ 128728 | consumed samples: 9552 | consumed tokens: 19562496 | elapsed time per iteration (s): 15.21 | learning rate: 3.130E-06 | global batch size: 16 | lm loss: 8.163157E+00 | grad norm: 2.253 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 598/ 128728 | consumed samples: 9568 | consumed tokens: 19595264 | elapsed time per iteration (s): 15.17 | learning rate: 3.135E-06 | global batch size: 16 | lm loss: 8.050694E+00 | grad norm: 1.486 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 599/ 128728 | consumed samples: 9584 | consumed tokens: 19628032 | elapsed time per iteration (s): 15.20 | learning rate: 3.140E-06 | global batch size: 16 | lm loss: 8.062954E+00 | grad norm: 1.669 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 600/ 128728 | consumed samples: 9600 | consumed tokens: 19660800 | elapsed time per iteration (s): 15.21 | learning rate: 3.146E-06 | global batch size: 16 | lm loss: 8.087465E+00 | grad norm: 1.461 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 601/ 128728 | consumed samples: 9616 | consumed tokens: 19693568 | elapsed time per iteration (s): 15.18 | learning rate: 3.151E-06 | global batch size: 16 | lm loss: 8.023573E+00 | grad norm: 1.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 602/ 128728 | consumed samples: 9632 | consumed tokens: 19726336 | elapsed time per iteration (s): 15.22 | learning rate: 3.156E-06 | global batch size: 16 | lm loss: 7.818781E+00 | grad norm: 1.654 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 603/ 128728 | consumed samples: 9648 | consumed tokens: 19759104 | elapsed time per iteration (s): 15.22 | learning rate: 3.161E-06 | global batch size: 16 | lm loss: 8.015720E+00 | grad norm: 1.395 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 604/ 128728 | consumed samples: 9664 | consumed tokens: 19791872 | elapsed time per iteration (s): 15.25 | learning rate: 3.167E-06 | global batch size: 16 | lm loss: 8.103092E+00 | grad norm: 2.917 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 605/ 128728 | consumed samples: 9680 | consumed tokens: 19824640 | elapsed time per iteration (s): 15.24 | learning rate: 3.172E-06 | global batch size: 16 | lm loss: 7.763780E+00 | grad norm: 2.339 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 606/ 128728 | consumed samples: 9696 | consumed tokens: 19857408 | elapsed time per iteration (s): 15.24 | learning rate: 3.177E-06 | global batch size: 16 | lm loss: 7.950778E+00 | grad norm: 1.569 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 607/ 128728 | consumed samples: 9712 | consumed tokens: 19890176 | elapsed time per iteration (s): 15.26 | learning rate: 3.182E-06 | global batch size: 16 | lm loss: 7.931053E+00 | grad norm: 2.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 608/ 128728 | consumed samples: 9728 | consumed tokens: 19922944 | elapsed time per iteration (s): 15.23 | learning rate: 3.188E-06 | global batch size: 16 | lm loss: 8.145065E+00 | grad norm: 1.738 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 609/ 128728 | consumed samples: 9744 | consumed tokens: 19955712 | elapsed time per iteration (s): 15.26 | learning rate: 3.193E-06 | global batch size: 16 | lm loss: 7.848729E+00 | grad norm: 1.252 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 610/ 128728 | consumed samples: 9760 | consumed tokens: 19988480 | elapsed time per iteration (s): 15.25 | learning rate: 3.198E-06 | global batch size: 16 | lm loss: 8.195064E+00 | grad norm: 2.007 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 611/ 128728 | consumed samples: 9776 | consumed tokens: 20021248 | elapsed time per iteration (s): 15.17 | learning rate: 3.203E-06 | global batch size: 16 | lm loss: 8.242138E+00 | grad norm: 1.863 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 612/ 128728 | consumed samples: 9792 | consumed tokens: 20054016 | elapsed time per iteration (s): 15.23 | learning rate: 3.209E-06 | global batch size: 16 | lm loss: 8.117524E+00 | grad norm: 1.384 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 613/ 128728 | consumed samples: 9808 | consumed tokens: 20086784 | elapsed time per iteration (s): 15.25 | learning rate: 3.214E-06 | global batch size: 16 | lm loss: 7.918928E+00 | grad norm: 1.371 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 614/ 128728 | consumed samples: 9824 | consumed tokens: 20119552 | elapsed time per iteration (s): 15.26 | learning rate: 3.219E-06 | global batch size: 16 | lm loss: 8.051600E+00 | grad norm: 2.007 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 615/ 128728 | consumed samples: 9840 | consumed tokens: 20152320 | elapsed time per iteration (s): 15.20 | learning rate: 3.224E-06 | global batch size: 16 | lm loss: 7.935547E+00 | grad norm: 1.270 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 616/ 128728 | consumed samples: 9856 | consumed tokens: 20185088 | elapsed time per iteration (s): 15.24 | learning rate: 3.230E-06 | global batch size: 16 | lm loss: 8.028803E+00 | grad norm: 1.547 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 617/ 128728 | consumed samples: 9872 | consumed tokens: 20217856 | elapsed time per iteration (s): 15.22 | learning rate: 3.235E-06 | global batch size: 16 | lm loss: 7.955178E+00 | grad norm: 1.555 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 618/ 128728 | consumed samples: 9888 | consumed tokens: 20250624 | elapsed time per iteration (s): 15.24 | learning rate: 3.240E-06 | global batch size: 16 | lm loss: 7.941856E+00 | grad norm: 2.057 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 619/ 128728 | consumed samples: 9904 | consumed tokens: 20283392 | elapsed time per iteration (s): 15.21 | learning rate: 3.245E-06 | global batch size: 16 | lm loss: 8.104746E+00 | grad norm: 1.580 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 620/ 128728 | consumed samples: 9920 | consumed tokens: 20316160 | elapsed time per iteration (s): 15.18 | learning rate: 3.251E-06 | global batch size: 16 | lm loss: 7.847284E+00 | grad norm: 1.397 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 621/ 128728 | consumed samples: 9936 | consumed tokens: 20348928 | elapsed time per iteration (s): 15.23 | learning rate: 3.256E-06 | global batch size: 16 | lm loss: 8.035351E+00 | grad norm: 1.532 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 622/ 128728 | consumed samples: 9952 | consumed tokens: 20381696 | elapsed time per iteration (s): 15.29 | learning rate: 3.261E-06 | global batch size: 16 | lm loss: 7.982013E+00 | grad norm: 1.415 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.01 | [default7]: iteration 623/ 128728 | consumed samples: 9968 | consumed tokens: 20414464 | elapsed time per iteration (s): 15.17 | learning rate: 3.266E-06 | global batch size: 16 | lm loss: 7.936229E+00 | grad norm: 2.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 624/ 128728 | consumed samples: 9984 | consumed tokens: 20447232 | elapsed time per iteration (s): 15.24 | learning rate: 3.272E-06 | global batch size: 16 | lm loss: 7.954924E+00 | grad norm: 1.624 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 625/ 128728 | consumed samples: 10000 | consumed tokens: 20480000 | elapsed time per iteration (s): 15.25 | learning rate: 3.277E-06 | global batch size: 16 | lm loss: 7.793859E+00 | grad norm: 1.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 626/ 128728 | consumed samples: 10016 | consumed tokens: 20512768 | elapsed time per iteration (s): 15.26 | learning rate: 3.282E-06 | global batch size: 16 | lm loss: 8.114607E+00 | grad norm: 1.913 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 627/ 128728 | consumed samples: 10032 | consumed tokens: 20545536 | elapsed time per iteration (s): 15.24 | learning rate: 3.287E-06 | global batch size: 16 | lm loss: 7.914503E+00 | grad norm: 1.774 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 628/ 128728 | consumed samples: 10048 | consumed tokens: 20578304 | elapsed time per iteration (s): 15.27 | learning rate: 3.293E-06 | global batch size: 16 | lm loss: 8.035368E+00 | grad norm: 1.859 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 629/ 128728 | consumed samples: 10064 | consumed tokens: 20611072 | elapsed time per iteration (s): 15.17 | learning rate: 3.298E-06 | global batch size: 16 | lm loss: 7.947924E+00 | grad norm: 1.470 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 630/ 128728 | consumed samples: 10080 | consumed tokens: 20643840 | elapsed time per iteration (s): 15.23 | learning rate: 3.303E-06 | global batch size: 16 | lm loss: 7.966818E+00 | grad norm: 1.898 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 631/ 128728 | consumed samples: 10096 | consumed tokens: 20676608 | elapsed time per iteration (s): 15.24 | learning rate: 3.308E-06 | global batch size: 16 | lm loss: 7.870564E+00 | grad norm: 1.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 632/ 128728 | consumed samples: 10112 | consumed tokens: 20709376 | elapsed time per iteration (s): 15.22 | learning rate: 3.314E-06 | global batch size: 16 | lm loss: 7.987050E+00 | grad norm: 1.295 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 633/ 128728 | consumed samples: 10128 | consumed tokens: 20742144 | elapsed time per iteration (s): 15.18 | learning rate: 3.319E-06 | global batch size: 16 | lm loss: 7.923104E+00 | grad norm: 1.387 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 634/ 128728 | consumed samples: 10144 | consumed tokens: 20774912 | elapsed time per iteration (s): 15.22 | learning rate: 3.324E-06 | global batch size: 16 | lm loss: 7.981370E+00 | grad norm: 3.554 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 635/ 128728 | consumed samples: 10160 | consumed tokens: 20807680 | elapsed time per iteration (s): 15.22 | learning rate: 3.329E-06 | global batch size: 16 | lm loss: 7.451349E+00 | grad norm: 2.705 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 636/ 128728 | consumed samples: 10176 | consumed tokens: 20840448 | elapsed time per iteration (s): 15.28 | learning rate: 3.334E-06 | global batch size: 16 | lm loss: 7.894579E+00 | grad norm: 1.334 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 637/ 128728 | consumed samples: 10192 | consumed tokens: 20873216 | elapsed time per iteration (s): 15.24 | learning rate: 3.340E-06 | global batch size: 16 | lm loss: 7.836030E+00 | grad norm: 1.562 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 638/ 128728 | consumed samples: 10208 | consumed tokens: 20905984 | elapsed time per iteration (s): 15.17 | learning rate: 3.345E-06 | global batch size: 16 | lm loss: 7.979243E+00 | grad norm: 1.756 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 639/ 128728 | consumed samples: 10224 | consumed tokens: 20938752 | elapsed time per iteration (s): 15.21 | learning rate: 3.350E-06 | global batch size: 16 | lm loss: 7.720065E+00 | grad norm: 1.240 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 640/ 128728 | consumed samples: 10240 | consumed tokens: 20971520 | elapsed time per iteration (s): 15.24 | learning rate: 3.355E-06 | global batch size: 16 | lm loss: 7.699399E+00 | grad norm: 1.935 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 641/ 128728 | consumed samples: 10256 | consumed tokens: 21004288 | elapsed time per iteration (s): 15.22 | learning rate: 3.361E-06 | global batch size: 16 | lm loss: 8.025188E+00 | grad norm: 2.391 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 642/ 128728 | consumed samples: 10272 | consumed tokens: 21037056 | elapsed time per iteration (s): 15.26 | learning rate: 3.366E-06 | global batch size: 16 | lm loss: 7.736159E+00 | grad norm: 1.592 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 643/ 128728 | consumed samples: 10288 | consumed tokens: 21069824 | elapsed time per iteration (s): 15.22 | learning rate: 3.371E-06 | global batch size: 16 | lm loss: 7.719475E+00 | grad norm: 2.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 644/ 128728 | consumed samples: 10304 | consumed tokens: 21102592 | elapsed time per iteration (s): 15.25 | learning rate: 3.376E-06 | global batch size: 16 | lm loss: 7.865746E+00 | grad norm: 1.433 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 645/ 128728 | consumed samples: 10320 | consumed tokens: 21135360 | elapsed time per iteration (s): 15.24 | learning rate: 3.382E-06 | global batch size: 16 | lm loss: 8.016085E+00 | grad norm: 2.456 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 646/ 128728 | consumed samples: 10336 | consumed tokens: 21168128 | elapsed time per iteration (s): 15.21 | learning rate: 3.387E-06 | global batch size: 16 | lm loss: 7.879150E+00 | grad norm: 1.573 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 647/ 128728 | consumed samples: 10352 | consumed tokens: 21200896 | elapsed time per iteration (s): 15.27 | learning rate: 3.392E-06 | global batch size: 16 | lm loss: 7.871262E+00 | grad norm: 2.059 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 648/ 128728 | consumed samples: 10368 | consumed tokens: 21233664 | elapsed time per iteration (s): 15.24 | learning rate: 3.397E-06 | global batch size: 16 | lm loss: 8.009554E+00 | grad norm: 1.466 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 649/ 128728 | consumed samples: 10384 | consumed tokens: 21266432 | elapsed time per iteration (s): 15.29 | learning rate: 3.403E-06 | global batch size: 16 | lm loss: 7.901595E+00 | grad norm: 1.675 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.046 | TFLOPs: 8.01 | [default7]: iteration 650/ 128728 | consumed samples: 10400 | consumed tokens: 21299200 | elapsed time per iteration (s): 15.24 | learning rate: 3.408E-06 | global batch size: 16 | lm loss: 7.781230E+00 | grad norm: 1.553 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 651/ 128728 | consumed samples: 10416 | consumed tokens: 21331968 | elapsed time per iteration (s): 15.24 | learning rate: 3.413E-06 | global batch size: 16 | lm loss: 7.900571E+00 | grad norm: 1.696 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 652/ 128728 | consumed samples: 10432 | consumed tokens: 21364736 | elapsed time per iteration (s): 15.27 | learning rate: 3.418E-06 | global batch size: 16 | lm loss: 8.001532E+00 | grad norm: 1.577 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 653/ 128728 | consumed samples: 10448 | consumed tokens: 21397504 | elapsed time per iteration (s): 15.22 | learning rate: 3.424E-06 | global batch size: 16 | lm loss: 7.724453E+00 | grad norm: 1.467 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 654/ 128728 | consumed samples: 10464 | consumed tokens: 21430272 | elapsed time per iteration (s): 15.22 | learning rate: 3.429E-06 | global batch size: 16 | lm loss: 7.786034E+00 | grad norm: 2.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 655/ 128728 | consumed samples: 10480 | consumed tokens: 21463040 | elapsed time per iteration (s): 15.21 | learning rate: 3.434E-06 | global batch size: 16 | lm loss: 8.125753E+00 | grad norm: 2.500 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 656/ 128728 | consumed samples: 10496 | consumed tokens: 21495808 | elapsed time per iteration (s): 15.18 | learning rate: 3.439E-06 | global batch size: 16 | lm loss: 7.898974E+00 | grad norm: 1.679 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 657/ 128728 | consumed samples: 10512 | consumed tokens: 21528576 | elapsed time per iteration (s): 15.25 | learning rate: 3.445E-06 | global batch size: 16 | lm loss: 7.543049E+00 | grad norm: 2.065 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 658/ 128728 | consumed samples: 10528 | consumed tokens: 21561344 | elapsed time per iteration (s): 15.24 | learning rate: 3.450E-06 | global batch size: 16 | lm loss: 7.893567E+00 | grad norm: 1.285 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 659/ 128728 | consumed samples: 10544 | consumed tokens: 21594112 | elapsed time per iteration (s): 15.24 | learning rate: 3.455E-06 | global batch size: 16 | lm loss: 7.778220E+00 | grad norm: 1.663 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 660/ 128728 | consumed samples: 10560 | consumed tokens: 21626880 | elapsed time per iteration (s): 15.24 | learning rate: 3.460E-06 | global batch size: 16 | lm loss: 7.709709E+00 | grad norm: 1.215 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 661/ 128728 | consumed samples: 10576 | consumed tokens: 21659648 | elapsed time per iteration (s): 15.26 | learning rate: 3.466E-06 | global batch size: 16 | lm loss: 7.854992E+00 | grad norm: 1.256 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 662/ 128728 | consumed samples: 10592 | consumed tokens: 21692416 | elapsed time per iteration (s): 15.23 | learning rate: 3.471E-06 | global batch size: 16 | lm loss: 7.712576E+00 | grad norm: 1.222 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 663/ 128728 | consumed samples: 10608 | consumed tokens: 21725184 | elapsed time per iteration (s): 15.24 | learning rate: 3.476E-06 | global batch size: 16 | lm loss: 7.989018E+00 | grad norm: 1.783 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 664/ 128728 | consumed samples: 10624 | consumed tokens: 21757952 | elapsed time per iteration (s): 15.21 | learning rate: 3.481E-06 | global batch size: 16 | lm loss: 7.714592E+00 | grad norm: 2.321 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 665/ 128728 | consumed samples: 10640 | consumed tokens: 21790720 | elapsed time per iteration (s): 15.24 | learning rate: 3.487E-06 | global batch size: 16 | lm loss: 7.794542E+00 | grad norm: 1.438 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 666/ 128728 | consumed samples: 10656 | consumed tokens: 21823488 | elapsed time per iteration (s): 15.26 | learning rate: 3.492E-06 | global batch size: 16 | lm loss: 7.864296E+00 | grad norm: 1.936 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 667/ 128728 | consumed samples: 10672 | consumed tokens: 21856256 | elapsed time per iteration (s): 15.27 | learning rate: 3.497E-06 | global batch size: 16 | lm loss: 7.764245E+00 | grad norm: 1.307 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 668/ 128728 | consumed samples: 10688 | consumed tokens: 21889024 | elapsed time per iteration (s): 15.25 | learning rate: 3.502E-06 | global batch size: 16 | lm loss: 7.735165E+00 | grad norm: 1.385 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 669/ 128728 | consumed samples: 10704 | consumed tokens: 21921792 | elapsed time per iteration (s): 15.26 | learning rate: 3.507E-06 | global batch size: 16 | lm loss: 8.010805E+00 | grad norm: 1.539 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 670/ 128728 | consumed samples: 10720 | consumed tokens: 21954560 | elapsed time per iteration (s): 15.21 | learning rate: 3.513E-06 | global batch size: 16 | lm loss: 8.047853E+00 | grad norm: 1.605 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 671/ 128728 | consumed samples: 10736 | consumed tokens: 21987328 | elapsed time per iteration (s): 15.17 | learning rate: 3.518E-06 | global batch size: 16 | lm loss: 7.816953E+00 | grad norm: 1.629 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 672/ 128728 | consumed samples: 10752 | consumed tokens: 22020096 | elapsed time per iteration (s): 15.23 | learning rate: 3.523E-06 | global batch size: 16 | lm loss: 7.711550E+00 | grad norm: 1.454 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 673/ 128728 | consumed samples: 10768 | consumed tokens: 22052864 | elapsed time per iteration (s): 15.25 | learning rate: 3.528E-06 | global batch size: 16 | lm loss: 7.875598E+00 | grad norm: 1.664 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 674/ 128728 | consumed samples: 10784 | consumed tokens: 22085632 | elapsed time per iteration (s): 15.26 | learning rate: 3.534E-06 | global batch size: 16 | lm loss: 8.126424E+00 | grad norm: 1.873 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 675/ 128728 | consumed samples: 10800 | consumed tokens: 22118400 | elapsed time per iteration (s): 15.22 | learning rate: 3.539E-06 | global batch size: 16 | lm loss: 7.853518E+00 | grad norm: 1.294 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 676/ 128728 | consumed samples: 10816 | consumed tokens: 22151168 | elapsed time per iteration (s): 15.27 | learning rate: 3.544E-06 | global batch size: 16 | lm loss: 7.600890E+00 | grad norm: 2.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 677/ 128728 | consumed samples: 10832 | consumed tokens: 22183936 | elapsed time per iteration (s): 15.18 | learning rate: 3.549E-06 | global batch size: 16 | lm loss: 8.051760E+00 | grad norm: 1.994 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 678/ 128728 | consumed samples: 10848 | consumed tokens: 22216704 | elapsed time per iteration (s): 15.24 | learning rate: 3.555E-06 | global batch size: 16 | lm loss: 7.808934E+00 | grad norm: 1.232 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 679/ 128728 | consumed samples: 10864 | consumed tokens: 22249472 | elapsed time per iteration (s): 15.22 | learning rate: 3.560E-06 | global batch size: 16 | lm loss: 7.776139E+00 | grad norm: 1.302 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 680/ 128728 | consumed samples: 10880 | consumed tokens: 22282240 | elapsed time per iteration (s): 15.20 | learning rate: 3.565E-06 | global batch size: 16 | lm loss: 7.821063E+00 | grad norm: 1.292 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 681/ 128728 | consumed samples: 10896 | consumed tokens: 22315008 | elapsed time per iteration (s): 15.26 | learning rate: 3.570E-06 | global batch size: 16 | lm loss: 8.087934E+00 | grad norm: 2.480 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 682/ 128728 | consumed samples: 10912 | consumed tokens: 22347776 | elapsed time per iteration (s): 15.21 | learning rate: 3.576E-06 | global batch size: 16 | lm loss: 7.812382E+00 | grad norm: 1.728 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 683/ 128728 | consumed samples: 10928 | consumed tokens: 22380544 | elapsed time per iteration (s): 15.23 | learning rate: 3.581E-06 | global batch size: 16 | lm loss: 7.646718E+00 | grad norm: 1.346 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 684/ 128728 | consumed samples: 10944 | consumed tokens: 22413312 | elapsed time per iteration (s): 15.21 | learning rate: 3.586E-06 | global batch size: 16 | lm loss: 7.798770E+00 | grad norm: 1.689 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 685/ 128728 | consumed samples: 10960 | consumed tokens: 22446080 | elapsed time per iteration (s): 15.26 | learning rate: 3.591E-06 | global batch size: 16 | lm loss: 7.723039E+00 | grad norm: 1.094 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 686/ 128728 | consumed samples: 10976 | consumed tokens: 22478848 | elapsed time per iteration (s): 15.22 | learning rate: 3.597E-06 | global batch size: 16 | lm loss: 7.886545E+00 | grad norm: 1.742 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 687/ 128728 | consumed samples: 10992 | consumed tokens: 22511616 | elapsed time per iteration (s): 15.24 | learning rate: 3.602E-06 | global batch size: 16 | lm loss: 7.877650E+00 | grad norm: 1.673 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 688/ 128728 | consumed samples: 11008 | consumed tokens: 22544384 | elapsed time per iteration (s): 15.24 | learning rate: 3.607E-06 | global batch size: 16 | lm loss: 7.980981E+00 | grad norm: 1.588 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 689/ 128728 | consumed samples: 11024 | consumed tokens: 22577152 | elapsed time per iteration (s): 15.27 | learning rate: 3.612E-06 | global batch size: 16 | lm loss: 7.830743E+00 | grad norm: 1.695 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 690/ 128728 | consumed samples: 11040 | consumed tokens: 22609920 | elapsed time per iteration (s): 15.27 | learning rate: 3.618E-06 | global batch size: 16 | lm loss: 7.749845E+00 | grad norm: 1.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 691/ 128728 | consumed samples: 11056 | consumed tokens: 22642688 | elapsed time per iteration (s): 15.22 | learning rate: 3.623E-06 | global batch size: 16 | lm loss: 7.584512E+00 | grad norm: 1.227 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 692/ 128728 | consumed samples: 11072 | consumed tokens: 22675456 | elapsed time per iteration (s): 15.19 | learning rate: 3.628E-06 | global batch size: 16 | lm loss: 7.821875E+00 | grad norm: 1.403 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 693/ 128728 | consumed samples: 11088 | consumed tokens: 22708224 | elapsed time per iteration (s): 15.21 | learning rate: 3.633E-06 | global batch size: 16 | lm loss: 7.693274E+00 | grad norm: 1.743 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 694/ 128728 | consumed samples: 11104 | consumed tokens: 22740992 | elapsed time per iteration (s): 15.25 | learning rate: 3.639E-06 | global batch size: 16 | lm loss: 7.663749E+00 | grad norm: 1.613 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 695/ 128728 | consumed samples: 11120 | consumed tokens: 22773760 | elapsed time per iteration (s): 15.20 | learning rate: 3.644E-06 | global batch size: 16 | lm loss: 7.937167E+00 | grad norm: 2.037 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 696/ 128728 | consumed samples: 11136 | consumed tokens: 22806528 | elapsed time per iteration (s): 15.23 | learning rate: 3.649E-06 | global batch size: 16 | lm loss: 7.848682E+00 | grad norm: 2.415 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 697/ 128728 | consumed samples: 11152 | consumed tokens: 22839296 | elapsed time per iteration (s): 15.22 | learning rate: 3.654E-06 | global batch size: 16 | lm loss: 7.809802E+00 | grad norm: 1.698 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 698/ 128728 | consumed samples: 11168 | consumed tokens: 22872064 | elapsed time per iteration (s): 15.22 | learning rate: 3.660E-06 | global batch size: 16 | lm loss: 7.940400E+00 | grad norm: 2.311 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 699/ 128728 | consumed samples: 11184 | consumed tokens: 22904832 | elapsed time per iteration (s): 15.19 | learning rate: 3.665E-06 | global batch size: 16 | lm loss: 7.481762E+00 | grad norm: 1.795 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 700/ 128728 | consumed samples: 11200 | consumed tokens: 22937600 | elapsed time per iteration (s): 15.24 | learning rate: 3.670E-06 | global batch size: 16 | lm loss: 7.774322E+00 | grad norm: 1.613 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 701/ 128728 | consumed samples: 11216 | consumed tokens: 22970368 | elapsed time per iteration (s): 15.21 | learning rate: 3.675E-06 | global batch size: 16 | lm loss: 7.873240E+00 | grad norm: 1.794 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 702/ 128728 | consumed samples: 11232 | consumed tokens: 23003136 | elapsed time per iteration (s): 15.22 | learning rate: 3.681E-06 | global batch size: 16 | lm loss: 7.976169E+00 | grad norm: 2.095 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 703/ 128728 | consumed samples: 11248 | consumed tokens: 23035904 | elapsed time per iteration (s): 15.24 | learning rate: 3.686E-06 | global batch size: 16 | lm loss: 7.603686E+00 | grad norm: 1.848 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 704/ 128728 | consumed samples: 11264 | consumed tokens: 23068672 | elapsed time per iteration (s): 15.20 | learning rate: 3.691E-06 | global batch size: 16 | lm loss: 7.877244E+00 | grad norm: 1.471 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 705/ 128728 | consumed samples: 11280 | consumed tokens: 23101440 | elapsed time per iteration (s): 15.23 | learning rate: 3.696E-06 | global batch size: 16 | lm loss: 7.863098E+00 | grad norm: 2.257 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 706/ 128728 | consumed samples: 11296 | consumed tokens: 23134208 | elapsed time per iteration (s): 15.26 | learning rate: 3.701E-06 | global batch size: 16 | lm loss: 7.755744E+00 | grad norm: 1.835 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 707/ 128728 | consumed samples: 11312 | consumed tokens: 23166976 | elapsed time per iteration (s): 15.19 | learning rate: 3.707E-06 | global batch size: 16 | lm loss: 7.782665E+00 | grad norm: 1.655 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 708/ 128728 | consumed samples: 11328 | consumed tokens: 23199744 | elapsed time per iteration (s): 15.22 | learning rate: 3.712E-06 | global batch size: 16 | lm loss: 7.712109E+00 | grad norm: 2.732 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 709/ 128728 | consumed samples: 11344 | consumed tokens: 23232512 | elapsed time per iteration (s): 15.23 | learning rate: 3.717E-06 | global batch size: 16 | lm loss: 7.964561E+00 | grad norm: 1.830 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 710/ 128728 | consumed samples: 11360 | consumed tokens: 23265280 | elapsed time per iteration (s): 15.23 | learning rate: 3.722E-06 | global batch size: 16 | lm loss: 7.780368E+00 | grad norm: 1.226 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 711/ 128728 | consumed samples: 11376 | consumed tokens: 23298048 | elapsed time per iteration (s): 15.25 | learning rate: 3.728E-06 | global batch size: 16 | lm loss: 7.641358E+00 | grad norm: 1.478 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 712/ 128728 | consumed samples: 11392 | consumed tokens: 23330816 | elapsed time per iteration (s): 15.22 | learning rate: 3.733E-06 | global batch size: 16 | lm loss: 7.781178E+00 | grad norm: 1.565 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 713/ 128728 | consumed samples: 11408 | consumed tokens: 23363584 | elapsed time per iteration (s): 15.22 | learning rate: 3.738E-06 | global batch size: 16 | lm loss: 7.840047E+00 | grad norm: 1.440 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 714/ 128728 | consumed samples: 11424 | consumed tokens: 23396352 | elapsed time per iteration (s): 15.21 | learning rate: 3.743E-06 | global batch size: 16 | lm loss: 8.027559E+00 | grad norm: 2.249 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 715/ 128728 | consumed samples: 11440 | consumed tokens: 23429120 | elapsed time per iteration (s): 15.27 | learning rate: 3.749E-06 | global batch size: 16 | lm loss: 7.754458E+00 | grad norm: 1.347 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 716/ 128728 | consumed samples: 11456 | consumed tokens: 23461888 | elapsed time per iteration (s): 15.22 | learning rate: 3.754E-06 | global batch size: 16 | lm loss: 7.990946E+00 | grad norm: 2.277 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 717/ 128728 | consumed samples: 11472 | consumed tokens: 23494656 | elapsed time per iteration (s): 15.21 | learning rate: 3.759E-06 | global batch size: 16 | lm loss: 7.646182E+00 | grad norm: 1.425 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 718/ 128728 | consumed samples: 11488 | consumed tokens: 23527424 | elapsed time per iteration (s): 15.21 | learning rate: 3.764E-06 | global batch size: 16 | lm loss: 7.488737E+00 | grad norm: 1.417 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 719/ 128728 | consumed samples: 11504 | consumed tokens: 23560192 | elapsed time per iteration (s): 15.24 | learning rate: 3.770E-06 | global batch size: 16 | lm loss: 8.027329E+00 | grad norm: 2.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 720/ 128728 | consumed samples: 11520 | consumed tokens: 23592960 | elapsed time per iteration (s): 15.22 | learning rate: 3.775E-06 | global batch size: 16 | lm loss: 7.625739E+00 | grad norm: 1.588 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 721/ 128728 | consumed samples: 11536 | consumed tokens: 23625728 | elapsed time per iteration (s): 15.21 | learning rate: 3.780E-06 | global batch size: 16 | lm loss: 7.805804E+00 | grad norm: 1.370 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 722/ 128728 | consumed samples: 11552 | consumed tokens: 23658496 | elapsed time per iteration (s): 15.21 | learning rate: 3.785E-06 | global batch size: 16 | lm loss: 7.652366E+00 | grad norm: 2.256 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 723/ 128728 | consumed samples: 11568 | consumed tokens: 23691264 | elapsed time per iteration (s): 15.22 | learning rate: 3.791E-06 | global batch size: 16 | lm loss: 7.692093E+00 | grad norm: 2.114 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 724/ 128728 | consumed samples: 11584 | consumed tokens: 23724032 | elapsed time per iteration (s): 15.19 | learning rate: 3.796E-06 | global batch size: 16 | lm loss: 8.017685E+00 | grad norm: 1.663 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 725/ 128728 | consumed samples: 11600 | consumed tokens: 23756800 | elapsed time per iteration (s): 15.22 | learning rate: 3.801E-06 | global batch size: 16 | lm loss: 7.816594E+00 | grad norm: 2.280 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 726/ 128728 | consumed samples: 11616 | consumed tokens: 23789568 | elapsed time per iteration (s): 15.27 | learning rate: 3.806E-06 | global batch size: 16 | lm loss: 7.751608E+00 | grad norm: 1.395 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 727/ 128728 | consumed samples: 11632 | consumed tokens: 23822336 | elapsed time per iteration (s): 15.22 | learning rate: 3.812E-06 | global batch size: 16 | lm loss: 7.849316E+00 | grad norm: 2.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 728/ 128728 | consumed samples: 11648 | consumed tokens: 23855104 | elapsed time per iteration (s): 15.24 | learning rate: 3.817E-06 | global batch size: 16 | lm loss: 7.822732E+00 | grad norm: 2.368 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 729/ 128728 | consumed samples: 11664 | consumed tokens: 23887872 | elapsed time per iteration (s): 15.25 | learning rate: 3.822E-06 | global batch size: 16 | lm loss: 7.754844E+00 | grad norm: 1.827 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 730/ 128728 | consumed samples: 11680 | consumed tokens: 23920640 | elapsed time per iteration (s): 15.21 | learning rate: 3.827E-06 | global batch size: 16 | lm loss: 7.941967E+00 | grad norm: 2.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 731/ 128728 | consumed samples: 11696 | consumed tokens: 23953408 | elapsed time per iteration (s): 15.24 | learning rate: 3.833E-06 | global batch size: 16 | lm loss: 7.668742E+00 | grad norm: 2.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 732/ 128728 | consumed samples: 11712 | consumed tokens: 23986176 | elapsed time per iteration (s): 15.21 | learning rate: 3.838E-06 | global batch size: 16 | lm loss: 7.611368E+00 | grad norm: 1.977 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 733/ 128728 | consumed samples: 11728 | consumed tokens: 24018944 | elapsed time per iteration (s): 15.23 | learning rate: 3.843E-06 | global batch size: 16 | lm loss: 7.738802E+00 | grad norm: 1.438 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 734/ 128728 | consumed samples: 11744 | consumed tokens: 24051712 | elapsed time per iteration (s): 15.26 | learning rate: 3.848E-06 | global batch size: 16 | lm loss: 8.192348E+00 | grad norm: 2.927 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 735/ 128728 | consumed samples: 11760 | consumed tokens: 24084480 | elapsed time per iteration (s): 15.25 | learning rate: 3.854E-06 | global batch size: 16 | lm loss: 7.825410E+00 | grad norm: 1.279 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 736/ 128728 | consumed samples: 11776 | consumed tokens: 24117248 | elapsed time per iteration (s): 15.23 | learning rate: 3.859E-06 | global batch size: 16 | lm loss: 7.624114E+00 | grad norm: 1.812 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 737/ 128728 | consumed samples: 11792 | consumed tokens: 24150016 | elapsed time per iteration (s): 15.21 | learning rate: 3.864E-06 | global batch size: 16 | lm loss: 7.622774E+00 | grad norm: 1.564 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 738/ 128728 | consumed samples: 11808 | consumed tokens: 24182784 | elapsed time per iteration (s): 15.20 | learning rate: 3.869E-06 | global batch size: 16 | lm loss: 7.684864E+00 | grad norm: 1.594 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 739/ 128728 | consumed samples: 11824 | consumed tokens: 24215552 | elapsed time per iteration (s): 15.21 | learning rate: 3.874E-06 | global batch size: 16 | lm loss: 7.810888E+00 | grad norm: 1.345 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 740/ 128728 | consumed samples: 11840 | consumed tokens: 24248320 | elapsed time per iteration (s): 15.22 | learning rate: 3.880E-06 | global batch size: 16 | lm loss: 7.660820E+00 | grad norm: 1.321 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 741/ 128728 | consumed samples: 11856 | consumed tokens: 24281088 | elapsed time per iteration (s): 15.25 | learning rate: 3.885E-06 | global batch size: 16 | lm loss: 7.710549E+00 | grad norm: 1.703 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 742/ 128728 | consumed samples: 11872 | consumed tokens: 24313856 | elapsed time per iteration (s): 15.26 | learning rate: 3.890E-06 | global batch size: 16 | lm loss: 7.604763E+00 | grad norm: 1.633 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 743/ 128728 | consumed samples: 11888 | consumed tokens: 24346624 | elapsed time per iteration (s): 15.24 | learning rate: 3.895E-06 | global batch size: 16 | lm loss: 7.866817E+00 | grad norm: 1.443 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 744/ 128728 | consumed samples: 11904 | consumed tokens: 24379392 | elapsed time per iteration (s): 15.20 | learning rate: 3.901E-06 | global batch size: 16 | lm loss: 7.738415E+00 | grad norm: 1.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 745/ 128728 | consumed samples: 11920 | consumed tokens: 24412160 | elapsed time per iteration (s): 15.26 | learning rate: 3.906E-06 | global batch size: 16 | lm loss: 7.592759E+00 | grad norm: 1.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 746/ 128728 | consumed samples: 11936 | consumed tokens: 24444928 | elapsed time per iteration (s): 15.25 | learning rate: 3.911E-06 | global batch size: 16 | lm loss: 7.961129E+00 | grad norm: 2.540 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.04 | [default7]: iteration 747/ 128728 | consumed samples: 11952 | consumed tokens: 24477696 | elapsed time per iteration (s): 15.24 | learning rate: 3.916E-06 | global batch size: 16 | lm loss: 7.821071E+00 | grad norm: 2.788 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 748/ 128728 | consumed samples: 11968 | consumed tokens: 24510464 | elapsed time per iteration (s): 15.24 | learning rate: 3.922E-06 | global batch size: 16 | lm loss: 7.662223E+00 | grad norm: 2.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 749/ 128728 | consumed samples: 11984 | consumed tokens: 24543232 | elapsed time per iteration (s): 15.26 | learning rate: 3.927E-06 | global batch size: 16 | lm loss: 7.541708E+00 | grad norm: 1.744 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 750/ 128728 | consumed samples: 12000 | consumed tokens: 24576000 | elapsed time per iteration (s): 15.26 | learning rate: 3.932E-06 | global batch size: 16 | lm loss: 7.680205E+00 | grad norm: 2.320 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 751/ 128728 | consumed samples: 12016 | consumed tokens: 24608768 | elapsed time per iteration (s): 15.21 | learning rate: 3.937E-06 | global batch size: 16 | lm loss: 7.828065E+00 | grad norm: 2.392 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 752/ 128728 | consumed samples: 12032 | consumed tokens: 24641536 | elapsed time per iteration (s): 15.18 | learning rate: 3.943E-06 | global batch size: 16 | lm loss: 7.748836E+00 | grad norm: 1.936 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 753/ 128728 | consumed samples: 12048 | consumed tokens: 24674304 | elapsed time per iteration (s): 15.22 | learning rate: 3.948E-06 | global batch size: 16 | lm loss: 7.542229E+00 | grad norm: 1.632 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 754/ 128728 | consumed samples: 12064 | consumed tokens: 24707072 | elapsed time per iteration (s): 15.27 | learning rate: 3.953E-06 | global batch size: 16 | lm loss: 7.863952E+00 | grad norm: 1.509 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 755/ 128728 | consumed samples: 12080 | consumed tokens: 24739840 | elapsed time per iteration (s): 15.22 | learning rate: 3.958E-06 | global batch size: 16 | lm loss: 7.741620E+00 | grad norm: 1.665 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 756/ 128728 | consumed samples: 12096 | consumed tokens: 24772608 | elapsed time per iteration (s): 15.19 | learning rate: 3.964E-06 | global batch size: 16 | lm loss: 7.805386E+00 | grad norm: 1.967 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 757/ 128728 | consumed samples: 12112 | consumed tokens: 24805376 | elapsed time per iteration (s): 15.24 | learning rate: 3.969E-06 | global batch size: 16 | lm loss: 7.797132E+00 | grad norm: 1.614 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 758/ 128728 | consumed samples: 12128 | consumed tokens: 24838144 | elapsed time per iteration (s): 15.21 | learning rate: 3.974E-06 | global batch size: 16 | lm loss: 7.813857E+00 | grad norm: 1.976 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 759/ 128728 | consumed samples: 12144 | consumed tokens: 24870912 | elapsed time per iteration (s): 15.26 | learning rate: 3.979E-06 | global batch size: 16 | lm loss: 7.982421E+00 | grad norm: 1.430 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 760/ 128728 | consumed samples: 12160 | consumed tokens: 24903680 | elapsed time per iteration (s): 15.22 | learning rate: 3.985E-06 | global batch size: 16 | lm loss: 7.646877E+00 | grad norm: 1.612 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 761/ 128728 | consumed samples: 12176 | consumed tokens: 24936448 | elapsed time per iteration (s): 15.27 | learning rate: 3.990E-06 | global batch size: 16 | lm loss: 7.730046E+00 | grad norm: 1.682 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 762/ 128728 | consumed samples: 12192 | consumed tokens: 24969216 | elapsed time per iteration (s): 15.23 | learning rate: 3.995E-06 | global batch size: 16 | lm loss: 7.657454E+00 | grad norm: 1.866 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 763/ 128728 | consumed samples: 12208 | consumed tokens: 25001984 | elapsed time per iteration (s): 15.24 | learning rate: 4.000E-06 | global batch size: 16 | lm loss: 7.702982E+00 | grad norm: 2.398 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 764/ 128728 | consumed samples: 12224 | consumed tokens: 25034752 | elapsed time per iteration (s): 15.21 | learning rate: 4.006E-06 | global batch size: 16 | lm loss: 7.730881E+00 | grad norm: 1.939 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 765/ 128728 | consumed samples: 12240 | consumed tokens: 25067520 | elapsed time per iteration (s): 15.23 | learning rate: 4.011E-06 | global batch size: 16 | lm loss: 7.581010E+00 | grad norm: 1.566 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 766/ 128728 | consumed samples: 12256 | consumed tokens: 25100288 | elapsed time per iteration (s): 15.26 | learning rate: 4.016E-06 | global batch size: 16 | lm loss: 7.640491E+00 | grad norm: 1.796 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 767/ 128728 | consumed samples: 12272 | consumed tokens: 25133056 | elapsed time per iteration (s): 15.20 | learning rate: 4.021E-06 | global batch size: 16 | lm loss: 7.722907E+00 | grad norm: 1.439 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 768/ 128728 | consumed samples: 12288 | consumed tokens: 25165824 | elapsed time per iteration (s): 15.24 | learning rate: 4.027E-06 | global batch size: 16 | lm loss: 7.654436E+00 | grad norm: 1.597 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 769/ 128728 | consumed samples: 12304 | consumed tokens: 25198592 | elapsed time per iteration (s): 15.83 | learning rate: 4.032E-06 | global batch size: 16 | lm loss: 7.346375E+00 | grad norm: 1.381 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.011 | TFLOPs: 7.74 | [default7]: iteration 770/ 128728 | consumed samples: 12320 | consumed tokens: 25231360 | elapsed time per iteration (s): 15.11 | learning rate: 4.037E-06 | global batch size: 16 | lm loss: 7.761549E+00 | grad norm: 1.695 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.059 | TFLOPs: 8.11 | [default7]: iteration 771/ 128728 | consumed samples: 12336 | consumed tokens: 25264128 | elapsed time per iteration (s): 15.03 | learning rate: 4.042E-06 | global batch size: 16 | lm loss: 7.792974E+00 | grad norm: 1.842 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.065 | TFLOPs: 8.15 | [default7]: iteration 772/ 128728 | consumed samples: 12352 | consumed tokens: 25296896 | elapsed time per iteration (s): 15.04 | learning rate: 4.048E-06 | global batch size: 16 | lm loss: 7.731169E+00 | grad norm: 1.828 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.064 | TFLOPs: 8.14 | [default7]: iteration 773/ 128728 | consumed samples: 12368 | consumed tokens: 25329664 | elapsed time per iteration (s): 15.05 | learning rate: 4.053E-06 | global batch size: 16 | lm loss: 7.725765E+00 | grad norm: 1.304 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.063 | TFLOPs: 8.14 | [default7]: iteration 774/ 128728 | consumed samples: 12384 | consumed tokens: 25362432 | elapsed time per iteration (s): 15.11 | learning rate: 4.058E-06 | global batch size: 16 | lm loss: 7.766714E+00 | grad norm: 1.359 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.059 | TFLOPs: 8.11 | [default7]: iteration 775/ 128728 | consumed samples: 12400 | consumed tokens: 25395200 | elapsed time per iteration (s): 15.10 | learning rate: 4.063E-06 | global batch size: 16 | lm loss: 7.545648E+00 | grad norm: 1.695 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.060 | TFLOPs: 8.11 | [default7]: iteration 776/ 128728 | consumed samples: 12416 | consumed tokens: 25427968 | elapsed time per iteration (s): 15.06 | learning rate: 4.068E-06 | global batch size: 16 | lm loss: 7.570961E+00 | grad norm: 1.884 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.062 | TFLOPs: 8.13 | [default7]: iteration 777/ 128728 | consumed samples: 12432 | consumed tokens: 25460736 | elapsed time per iteration (s): 15.12 | learning rate: 4.074E-06 | global batch size: 16 | lm loss: 7.759934E+00 | grad norm: 1.397 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.058 | TFLOPs: 8.10 | [default7]: iteration 778/ 128728 | consumed samples: 12448 | consumed tokens: 25493504 | elapsed time per iteration (s): 15.07 | learning rate: 4.079E-06 | global batch size: 16 | lm loss: 7.718737E+00 | grad norm: 1.466 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.062 | TFLOPs: 8.13 | [default7]: iteration 779/ 128728 | consumed samples: 12464 | consumed tokens: 25526272 | elapsed time per iteration (s): 15.05 | learning rate: 4.084E-06 | global batch size: 16 | lm loss: 7.721785E+00 | grad norm: 1.368 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.063 | TFLOPs: 8.14 | [default7]: iteration 780/ 128728 | consumed samples: 12480 | consumed tokens: 25559040 | elapsed time per iteration (s): 15.07 | learning rate: 4.089E-06 | global batch size: 16 | lm loss: 7.713555E+00 | grad norm: 1.278 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.061 | TFLOPs: 8.13 | [default7]: iteration 781/ 128728 | consumed samples: 12496 | consumed tokens: 25591808 | elapsed time per iteration (s): 14.98 | learning rate: 4.095E-06 | global batch size: 16 | lm loss: 7.670259E+00 | grad norm: 1.086 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.068 | TFLOPs: 8.18 | [default7]: iteration 782/ 128728 | consumed samples: 12512 | consumed tokens: 25624576 | elapsed time per iteration (s): 15.03 | learning rate: 4.100E-06 | global batch size: 16 | lm loss: 7.595325E+00 | grad norm: 1.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.064 | TFLOPs: 8.15 | [default7]: iteration 783/ 128728 | consumed samples: 12528 | consumed tokens: 25657344 | elapsed time per iteration (s): 15.13 | learning rate: 4.105E-06 | global batch size: 16 | lm loss: 7.531812E+00 | grad norm: 1.057 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.058 | TFLOPs: 8.10 | [default7]: iteration 784/ 128728 | consumed samples: 12544 | consumed tokens: 25690112 | elapsed time per iteration (s): 15.07 | learning rate: 4.110E-06 | global batch size: 16 | lm loss: 7.455743E+00 | grad norm: 1.269 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.062 | TFLOPs: 8.13 | [default7]: iteration 785/ 128728 | consumed samples: 12560 | consumed tokens: 25722880 | elapsed time per iteration (s): 15.05 | learning rate: 4.116E-06 | global batch size: 16 | lm loss: 7.543957E+00 | grad norm: 1.343 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.063 | TFLOPs: 8.14 | [default7]: iteration 786/ 128728 | consumed samples: 12576 | consumed tokens: 25755648 | elapsed time per iteration (s): 15.07 | learning rate: 4.121E-06 | global batch size: 16 | lm loss: 7.538573E+00 | grad norm: 1.470 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.062 | TFLOPs: 8.13 | [default7]: iteration 787/ 128728 | consumed samples: 12592 | consumed tokens: 25788416 | elapsed time per iteration (s): 15.05 | learning rate: 4.126E-06 | global batch size: 16 | lm loss: 7.473088E+00 | grad norm: 1.309 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.063 | TFLOPs: 8.14 | [default7]: iteration 788/ 128728 | consumed samples: 12608 | consumed tokens: 25821184 | elapsed time per iteration (s): 15.04 | learning rate: 4.131E-06 | global batch size: 16 | lm loss: 7.735221E+00 | grad norm: 1.010 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.064 | TFLOPs: 8.15 | [default7]: iteration 789/ 128728 | consumed samples: 12624 | consumed tokens: 25853952 | elapsed time per iteration (s): 15.08 | learning rate: 4.137E-06 | global batch size: 16 | lm loss: 7.478712E+00 | grad norm: 1.605 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.061 | TFLOPs: 8.12 | [default7]: iteration 790/ 128728 | consumed samples: 12640 | consumed tokens: 25886720 | elapsed time per iteration (s): 15.02 | learning rate: 4.142E-06 | global batch size: 16 | lm loss: 7.608325E+00 | grad norm: 1.282 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.065 | TFLOPs: 8.16 | [default7]: iteration 791/ 128728 | consumed samples: 12656 | consumed tokens: 25919488 | elapsed time per iteration (s): 15.06 | learning rate: 4.147E-06 | global batch size: 16 | lm loss: 7.450841E+00 | grad norm: 1.401 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.062 | TFLOPs: 8.13 | [default7]: iteration 792/ 128728 | consumed samples: 12672 | consumed tokens: 25952256 | elapsed time per iteration (s): 15.03 | learning rate: 4.152E-06 | global batch size: 16 | lm loss: 7.622550E+00 | grad norm: 1.494 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.065 | TFLOPs: 8.15 | [default7]: iteration 793/ 128728 | consumed samples: 12688 | consumed tokens: 25985024 | elapsed time per iteration (s): 15.13 | learning rate: 4.158E-06 | global batch size: 16 | lm loss: 7.475448E+00 | grad norm: 1.436 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.058 | TFLOPs: 8.10 | [default7]: iteration 794/ 128728 | consumed samples: 12704 | consumed tokens: 26017792 | elapsed time per iteration (s): 15.07 | learning rate: 4.163E-06 | global batch size: 16 | lm loss: 7.738382E+00 | grad norm: 1.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.062 | TFLOPs: 8.13 | [default7]: iteration 795/ 128728 | consumed samples: 12720 | consumed tokens: 26050560 | elapsed time per iteration (s): 14.97 | learning rate: 4.168E-06 | global batch size: 16 | lm loss: 7.791917E+00 | grad norm: 1.774 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.069 | TFLOPs: 8.18 | [default7]: iteration 796/ 128728 | consumed samples: 12736 | consumed tokens: 26083328 | elapsed time per iteration (s): 14.84 | learning rate: 4.173E-06 | global batch size: 16 | lm loss: 7.620174E+00 | grad norm: 1.641 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.078 | TFLOPs: 8.25 | [default7]: iteration 797/ 128728 | consumed samples: 12752 | consumed tokens: 26116096 | elapsed time per iteration (s): 15.25 | learning rate: 4.179E-06 | global batch size: 16 | lm loss: 7.478716E+00 | grad norm: 1.076 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 798/ 128728 | consumed samples: 12768 | consumed tokens: 26148864 | elapsed time per iteration (s): 15.23 | learning rate: 4.184E-06 | global batch size: 16 | lm loss: 7.694183E+00 | grad norm: 2.498 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 799/ 128728 | consumed samples: 12784 | consumed tokens: 26181632 | elapsed time per iteration (s): 15.22 | learning rate: 4.189E-06 | global batch size: 16 | lm loss: 7.490611E+00 | grad norm: 1.571 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 800/ 128728 | consumed samples: 12800 | consumed tokens: 26214400 | elapsed time per iteration (s): 15.24 | learning rate: 4.194E-06 | global batch size: 16 | lm loss: 7.733963E+00 | grad norm: 2.020 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 801/ 128728 | consumed samples: 12816 | consumed tokens: 26247168 | elapsed time per iteration (s): 15.26 | learning rate: 4.200E-06 | global batch size: 16 | lm loss: 7.516152E+00 | grad norm: 1.338 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 802/ 128728 | consumed samples: 12832 | consumed tokens: 26279936 | elapsed time per iteration (s): 15.23 | learning rate: 4.205E-06 | global batch size: 16 | lm loss: 7.613828E+00 | grad norm: 2.668 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 803/ 128728 | consumed samples: 12848 | consumed tokens: 26312704 | elapsed time per iteration (s): 15.24 | learning rate: 4.210E-06 | global batch size: 16 | lm loss: 7.903152E+00 | grad norm: 1.109 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 804/ 128728 | consumed samples: 12864 | consumed tokens: 26345472 | elapsed time per iteration (s): 15.30 | learning rate: 4.215E-06 | global batch size: 16 | lm loss: 7.665509E+00 | grad norm: 1.709 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.046 | TFLOPs: 8.01 | [default7]: iteration 805/ 128728 | consumed samples: 12880 | consumed tokens: 26378240 | elapsed time per iteration (s): 15.25 | learning rate: 4.221E-06 | global batch size: 16 | lm loss: 7.686241E+00 | grad norm: 1.219 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 806/ 128728 | consumed samples: 12896 | consumed tokens: 26411008 | elapsed time per iteration (s): 15.24 | learning rate: 4.226E-06 | global batch size: 16 | lm loss: 7.861027E+00 | grad norm: 1.783 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 807/ 128728 | consumed samples: 12912 | consumed tokens: 26443776 | elapsed time per iteration (s): 15.24 | learning rate: 4.231E-06 | global batch size: 16 | lm loss: 7.592918E+00 | grad norm: 1.554 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 808/ 128728 | consumed samples: 12928 | consumed tokens: 26476544 | elapsed time per iteration (s): 15.21 | learning rate: 4.236E-06 | global batch size: 16 | lm loss: 7.650827E+00 | grad norm: 1.592 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 809/ 128728 | consumed samples: 12944 | consumed tokens: 26509312 | elapsed time per iteration (s): 15.22 | learning rate: 4.242E-06 | global batch size: 16 | lm loss: 7.584604E+00 | grad norm: 1.414 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 810/ 128728 | consumed samples: 12960 | consumed tokens: 26542080 | elapsed time per iteration (s): 15.23 | learning rate: 4.247E-06 | global batch size: 16 | lm loss: 7.401367E+00 | grad norm: 1.258 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 811/ 128728 | consumed samples: 12976 | consumed tokens: 26574848 | elapsed time per iteration (s): 15.27 | learning rate: 4.252E-06 | global batch size: 16 | lm loss: 7.733647E+00 | grad norm: 1.621 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 812/ 128728 | consumed samples: 12992 | consumed tokens: 26607616 | elapsed time per iteration (s): 15.25 | learning rate: 4.257E-06 | global batch size: 16 | lm loss: 7.667072E+00 | grad norm: 1.590 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 813/ 128728 | consumed samples: 13008 | consumed tokens: 26640384 | elapsed time per iteration (s): 15.23 | learning rate: 4.262E-06 | global batch size: 16 | lm loss: 7.803669E+00 | grad norm: 1.824 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 814/ 128728 | consumed samples: 13024 | consumed tokens: 26673152 | elapsed time per iteration (s): 15.20 | learning rate: 4.268E-06 | global batch size: 16 | lm loss: 7.590942E+00 | grad norm: 1.495 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 815/ 128728 | consumed samples: 13040 | consumed tokens: 26705920 | elapsed time per iteration (s): 15.18 | learning rate: 4.273E-06 | global batch size: 16 | lm loss: 7.517165E+00 | grad norm: 1.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 816/ 128728 | consumed samples: 13056 | consumed tokens: 26738688 | elapsed time per iteration (s): 15.21 | learning rate: 4.278E-06 | global batch size: 16 | lm loss: 7.709677E+00 | grad norm: 1.526 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 817/ 128728 | consumed samples: 13072 | consumed tokens: 26771456 | elapsed time per iteration (s): 15.23 | learning rate: 4.283E-06 | global batch size: 16 | lm loss: 7.403444E+00 | grad norm: 1.416 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 818/ 128728 | consumed samples: 13088 | consumed tokens: 26804224 | elapsed time per iteration (s): 15.17 | learning rate: 4.289E-06 | global batch size: 16 | lm loss: 8.024632E+00 | grad norm: 1.718 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 819/ 128728 | consumed samples: 13104 | consumed tokens: 26836992 | elapsed time per iteration (s): 15.20 | learning rate: 4.294E-06 | global batch size: 16 | lm loss: 7.400269E+00 | grad norm: 2.521 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 820/ 128728 | consumed samples: 13120 | consumed tokens: 26869760 | elapsed time per iteration (s): 15.23 | learning rate: 4.299E-06 | global batch size: 16 | lm loss: 7.701864E+00 | grad norm: 1.221 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 821/ 128728 | consumed samples: 13136 | consumed tokens: 26902528 | elapsed time per iteration (s): 15.24 | learning rate: 4.304E-06 | global batch size: 16 | lm loss: 7.637981E+00 | grad norm: 1.868 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 822/ 128728 | consumed samples: 13152 | consumed tokens: 26935296 | elapsed time per iteration (s): 15.23 | learning rate: 4.310E-06 | global batch size: 16 | lm loss: 7.643306E+00 | grad norm: 2.043 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 823/ 128728 | consumed samples: 13168 | consumed tokens: 26968064 | elapsed time per iteration (s): 15.21 | learning rate: 4.315E-06 | global batch size: 16 | lm loss: 7.696064E+00 | grad norm: 2.010 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 824/ 128728 | consumed samples: 13184 | consumed tokens: 27000832 | elapsed time per iteration (s): 15.21 | learning rate: 4.320E-06 | global batch size: 16 | lm loss: 7.558313E+00 | grad norm: 1.300 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 825/ 128728 | consumed samples: 13200 | consumed tokens: 27033600 | elapsed time per iteration (s): 15.24 | learning rate: 4.325E-06 | global batch size: 16 | lm loss: 7.669714E+00 | grad norm: 2.262 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 826/ 128728 | consumed samples: 13216 | consumed tokens: 27066368 | elapsed time per iteration (s): 15.22 | learning rate: 4.331E-06 | global batch size: 16 | lm loss: 7.444411E+00 | grad norm: 1.357 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 827/ 128728 | consumed samples: 13232 | consumed tokens: 27099136 | elapsed time per iteration (s): 15.22 | learning rate: 4.336E-06 | global batch size: 16 | lm loss: 7.498442E+00 | grad norm: 1.447 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 828/ 128728 | consumed samples: 13248 | consumed tokens: 27131904 | elapsed time per iteration (s): 15.22 | learning rate: 4.341E-06 | global batch size: 16 | lm loss: 7.616692E+00 | grad norm: 1.572 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 829/ 128728 | consumed samples: 13264 | consumed tokens: 27164672 | elapsed time per iteration (s): 15.16 | learning rate: 4.346E-06 | global batch size: 16 | lm loss: 7.807779E+00 | grad norm: 1.747 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 830/ 128728 | consumed samples: 13280 | consumed tokens: 27197440 | elapsed time per iteration (s): 15.17 | learning rate: 4.352E-06 | global batch size: 16 | lm loss: 7.562619E+00 | grad norm: 1.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 831/ 128728 | consumed samples: 13296 | consumed tokens: 27230208 | elapsed time per iteration (s): 15.24 | learning rate: 4.357E-06 | global batch size: 16 | lm loss: 7.482844E+00 | grad norm: 3.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 832/ 128728 | consumed samples: 13312 | consumed tokens: 27262976 | elapsed time per iteration (s): 15.26 | learning rate: 4.362E-06 | global batch size: 16 | lm loss: 7.744891E+00 | grad norm: 1.439 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 833/ 128728 | consumed samples: 13328 | consumed tokens: 27295744 | elapsed time per iteration (s): 15.25 | learning rate: 4.367E-06 | global batch size: 16 | lm loss: 7.532849E+00 | grad norm: 1.263 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 834/ 128728 | consumed samples: 13344 | consumed tokens: 27328512 | elapsed time per iteration (s): 15.24 | learning rate: 4.373E-06 | global batch size: 16 | lm loss: 7.463634E+00 | grad norm: 1.458 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 835/ 128728 | consumed samples: 13360 | consumed tokens: 27361280 | elapsed time per iteration (s): 15.22 | learning rate: 4.378E-06 | global batch size: 16 | lm loss: 7.629139E+00 | grad norm: 1.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 836/ 128728 | consumed samples: 13376 | consumed tokens: 27394048 | elapsed time per iteration (s): 15.23 | learning rate: 4.383E-06 | global batch size: 16 | lm loss: 7.463190E+00 | grad norm: 1.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 837/ 128728 | consumed samples: 13392 | consumed tokens: 27426816 | elapsed time per iteration (s): 15.27 | learning rate: 4.388E-06 | global batch size: 16 | lm loss: 7.357310E+00 | grad norm: 1.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 838/ 128728 | consumed samples: 13408 | consumed tokens: 27459584 | elapsed time per iteration (s): 15.27 | learning rate: 4.394E-06 | global batch size: 16 | lm loss: 7.757633E+00 | grad norm: 1.049 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 839/ 128728 | consumed samples: 13424 | consumed tokens: 27492352 | elapsed time per iteration (s): 15.27 | learning rate: 4.399E-06 | global batch size: 16 | lm loss: 7.545015E+00 | grad norm: 1.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 840/ 128728 | consumed samples: 13440 | consumed tokens: 27525120 | elapsed time per iteration (s): 15.25 | learning rate: 4.404E-06 | global batch size: 16 | lm loss: 7.411932E+00 | grad norm: 1.277 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 841/ 128728 | consumed samples: 13456 | consumed tokens: 27557888 | elapsed time per iteration (s): 15.25 | learning rate: 4.409E-06 | global batch size: 16 | lm loss: 7.422668E+00 | grad norm: 1.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 842/ 128728 | consumed samples: 13472 | consumed tokens: 27590656 | elapsed time per iteration (s): 15.18 | learning rate: 4.415E-06 | global batch size: 16 | lm loss: 7.665534E+00 | grad norm: 1.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 843/ 128728 | consumed samples: 13488 | consumed tokens: 27623424 | elapsed time per iteration (s): 15.25 | learning rate: 4.420E-06 | global batch size: 16 | lm loss: 7.618068E+00 | grad norm: 1.068 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 844/ 128728 | consumed samples: 13504 | consumed tokens: 27656192 | elapsed time per iteration (s): 15.18 | learning rate: 4.425E-06 | global batch size: 16 | lm loss: 7.596480E+00 | grad norm: 0.983 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 845/ 128728 | consumed samples: 13520 | consumed tokens: 27688960 | elapsed time per iteration (s): 15.17 | learning rate: 4.430E-06 | global batch size: 16 | lm loss: 7.562824E+00 | grad norm: 1.069 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 846/ 128728 | consumed samples: 13536 | consumed tokens: 27721728 | elapsed time per iteration (s): 15.26 | learning rate: 4.435E-06 | global batch size: 16 | lm loss: 7.561560E+00 | grad norm: 1.276 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 847/ 128728 | consumed samples: 13552 | consumed tokens: 27754496 | elapsed time per iteration (s): 15.22 | learning rate: 4.441E-06 | global batch size: 16 | lm loss: 7.958152E+00 | grad norm: 1.875 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 848/ 128728 | consumed samples: 13568 | consumed tokens: 27787264 | elapsed time per iteration (s): 15.18 | learning rate: 4.446E-06 | global batch size: 16 | lm loss: 7.501763E+00 | grad norm: 0.955 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 849/ 128728 | consumed samples: 13584 | consumed tokens: 27820032 | elapsed time per iteration (s): 15.26 | learning rate: 4.451E-06 | global batch size: 16 | lm loss: 7.435274E+00 | grad norm: 1.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 850/ 128728 | consumed samples: 13600 | consumed tokens: 27852800 | elapsed time per iteration (s): 15.26 | learning rate: 4.456E-06 | global batch size: 16 | lm loss: 7.425239E+00 | grad norm: 1.256 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 851/ 128728 | consumed samples: 13616 | consumed tokens: 27885568 | elapsed time per iteration (s): 15.26 | learning rate: 4.462E-06 | global batch size: 16 | lm loss: 7.559560E+00 | grad norm: 1.333 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 852/ 128728 | consumed samples: 13632 | consumed tokens: 27918336 | elapsed time per iteration (s): 15.24 | learning rate: 4.467E-06 | global batch size: 16 | lm loss: 7.470264E+00 | grad norm: 1.636 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 853/ 128728 | consumed samples: 13648 | consumed tokens: 27951104 | elapsed time per iteration (s): 15.18 | learning rate: 4.472E-06 | global batch size: 16 | lm loss: 7.504191E+00 | grad norm: 1.602 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 854/ 128728 | consumed samples: 13664 | consumed tokens: 27983872 | elapsed time per iteration (s): 15.23 | learning rate: 4.477E-06 | global batch size: 16 | lm loss: 7.452326E+00 | grad norm: 1.309 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 855/ 128728 | consumed samples: 13680 | consumed tokens: 28016640 | elapsed time per iteration (s): 15.18 | learning rate: 4.483E-06 | global batch size: 16 | lm loss: 7.583494E+00 | grad norm: 1.291 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 856/ 128728 | consumed samples: 13696 | consumed tokens: 28049408 | elapsed time per iteration (s): 15.25 | learning rate: 4.488E-06 | global batch size: 16 | lm loss: 7.333179E+00 | grad norm: 1.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 857/ 128728 | consumed samples: 13712 | consumed tokens: 28082176 | elapsed time per iteration (s): 15.24 | learning rate: 4.493E-06 | global batch size: 16 | lm loss: 7.519557E+00 | grad norm: 1.062 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 858/ 128728 | consumed samples: 13728 | consumed tokens: 28114944 | elapsed time per iteration (s): 15.25 | learning rate: 4.498E-06 | global batch size: 16 | lm loss: 7.641896E+00 | grad norm: 1.367 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 859/ 128728 | consumed samples: 13744 | consumed tokens: 28147712 | elapsed time per iteration (s): 15.17 | learning rate: 4.504E-06 | global batch size: 16 | lm loss: 7.602086E+00 | grad norm: 1.272 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 860/ 128728 | consumed samples: 13760 | consumed tokens: 28180480 | elapsed time per iteration (s): 15.19 | learning rate: 4.509E-06 | global batch size: 16 | lm loss: 7.520714E+00 | grad norm: 1.215 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 861/ 128728 | consumed samples: 13776 | consumed tokens: 28213248 | elapsed time per iteration (s): 15.22 | learning rate: 4.514E-06 | global batch size: 16 | lm loss: 7.511874E+00 | grad norm: 1.317 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 862/ 128728 | consumed samples: 13792 | consumed tokens: 28246016 | elapsed time per iteration (s): 15.16 | learning rate: 4.519E-06 | global batch size: 16 | lm loss: 7.545038E+00 | grad norm: 1.074 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 863/ 128728 | consumed samples: 13808 | consumed tokens: 28278784 | elapsed time per iteration (s): 15.21 | learning rate: 4.525E-06 | global batch size: 16 | lm loss: 7.392710E+00 | grad norm: 1.944 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 864/ 128728 | consumed samples: 13824 | consumed tokens: 28311552 | elapsed time per iteration (s): 15.26 | learning rate: 4.530E-06 | global batch size: 16 | lm loss: 7.715175E+00 | grad norm: 1.471 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 865/ 128728 | consumed samples: 13840 | consumed tokens: 28344320 | elapsed time per iteration (s): 15.22 | learning rate: 4.535E-06 | global batch size: 16 | lm loss: 7.498834E+00 | grad norm: 1.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 866/ 128728 | consumed samples: 13856 | consumed tokens: 28377088 | elapsed time per iteration (s): 15.25 | learning rate: 4.540E-06 | global batch size: 16 | lm loss: 7.556900E+00 | grad norm: 1.534 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 867/ 128728 | consumed samples: 13872 | consumed tokens: 28409856 | elapsed time per iteration (s): 15.21 | learning rate: 4.546E-06 | global batch size: 16 | lm loss: 7.598176E+00 | grad norm: 1.360 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 868/ 128728 | consumed samples: 13888 | consumed tokens: 28442624 | elapsed time per iteration (s): 15.24 | learning rate: 4.551E-06 | global batch size: 16 | lm loss: 7.491490E+00 | grad norm: 1.580 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 869/ 128728 | consumed samples: 13904 | consumed tokens: 28475392 | elapsed time per iteration (s): 15.26 | learning rate: 4.556E-06 | global batch size: 16 | lm loss: 7.520513E+00 | grad norm: 1.223 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 870/ 128728 | consumed samples: 13920 | consumed tokens: 28508160 | elapsed time per iteration (s): 15.23 | learning rate: 4.561E-06 | global batch size: 16 | lm loss: 7.169995E+00 | grad norm: 1.862 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 871/ 128728 | consumed samples: 13936 | consumed tokens: 28540928 | elapsed time per iteration (s): 15.20 | learning rate: 4.567E-06 | global batch size: 16 | lm loss: 7.613565E+00 | grad norm: 1.412 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 872/ 128728 | consumed samples: 13952 | consumed tokens: 28573696 | elapsed time per iteration (s): 15.23 | learning rate: 4.572E-06 | global batch size: 16 | lm loss: 7.603791E+00 | grad norm: 1.376 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 873/ 128728 | consumed samples: 13968 | consumed tokens: 28606464 | elapsed time per iteration (s): 15.26 | learning rate: 4.577E-06 | global batch size: 16 | lm loss: 7.504703E+00 | grad norm: 2.425 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 874/ 128728 | consumed samples: 13984 | consumed tokens: 28639232 | elapsed time per iteration (s): 15.25 | learning rate: 4.582E-06 | global batch size: 16 | lm loss: 7.594444E+00 | grad norm: 1.738 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 875/ 128728 | consumed samples: 14000 | consumed tokens: 28672000 | elapsed time per iteration (s): 15.19 | learning rate: 4.588E-06 | global batch size: 16 | lm loss: 7.600210E+00 | grad norm: 1.639 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 876/ 128728 | consumed samples: 14016 | consumed tokens: 28704768 | elapsed time per iteration (s): 15.22 | learning rate: 4.593E-06 | global batch size: 16 | lm loss: 7.522717E+00 | grad norm: 1.620 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 877/ 128728 | consumed samples: 14032 | consumed tokens: 28737536 | elapsed time per iteration (s): 15.26 | learning rate: 4.598E-06 | global batch size: 16 | lm loss: 7.450993E+00 | grad norm: 1.517 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 878/ 128728 | consumed samples: 14048 | consumed tokens: 28770304 | elapsed time per iteration (s): 15.26 | learning rate: 4.603E-06 | global batch size: 16 | lm loss: 7.297291E+00 | grad norm: 1.619 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 879/ 128728 | consumed samples: 14064 | consumed tokens: 28803072 | elapsed time per iteration (s): 15.22 | learning rate: 4.609E-06 | global batch size: 16 | lm loss: 7.489501E+00 | grad norm: 2.042 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 880/ 128728 | consumed samples: 14080 | consumed tokens: 28835840 | elapsed time per iteration (s): 15.24 | learning rate: 4.614E-06 | global batch size: 16 | lm loss: 7.403663E+00 | grad norm: 1.527 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 881/ 128728 | consumed samples: 14096 | consumed tokens: 28868608 | elapsed time per iteration (s): 15.23 | learning rate: 4.619E-06 | global batch size: 16 | lm loss: 7.537346E+00 | grad norm: 1.324 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 882/ 128728 | consumed samples: 14112 | consumed tokens: 28901376 | elapsed time per iteration (s): 15.23 | learning rate: 4.624E-06 | global batch size: 16 | lm loss: 7.363647E+00 | grad norm: 1.856 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 883/ 128728 | consumed samples: 14128 | consumed tokens: 28934144 | elapsed time per iteration (s): 15.24 | learning rate: 4.629E-06 | global batch size: 16 | lm loss: 7.634407E+00 | grad norm: 1.268 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 884/ 128728 | consumed samples: 14144 | consumed tokens: 28966912 | elapsed time per iteration (s): 15.22 | learning rate: 4.635E-06 | global batch size: 16 | lm loss: 7.377182E+00 | grad norm: 1.882 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 885/ 128728 | consumed samples: 14160 | consumed tokens: 28999680 | elapsed time per iteration (s): 15.24 | learning rate: 4.640E-06 | global batch size: 16 | lm loss: 7.484207E+00 | grad norm: 1.435 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 886/ 128728 | consumed samples: 14176 | consumed tokens: 29032448 | elapsed time per iteration (s): 15.23 | learning rate: 4.645E-06 | global batch size: 16 | lm loss: 7.508356E+00 | grad norm: 1.357 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 887/ 128728 | consumed samples: 14192 | consumed tokens: 29065216 | elapsed time per iteration (s): 15.24 | learning rate: 4.650E-06 | global batch size: 16 | lm loss: 7.583908E+00 | grad norm: 1.316 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 888/ 128728 | consumed samples: 14208 | consumed tokens: 29097984 | elapsed time per iteration (s): 15.25 | learning rate: 4.656E-06 | global batch size: 16 | lm loss: 7.400177E+00 | grad norm: 1.628 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 889/ 128728 | consumed samples: 14224 | consumed tokens: 29130752 | elapsed time per iteration (s): 15.23 | learning rate: 4.661E-06 | global batch size: 16 | lm loss: 7.434398E+00 | grad norm: 1.261 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 890/ 128728 | consumed samples: 14240 | consumed tokens: 29163520 | elapsed time per iteration (s): 15.23 | learning rate: 4.666E-06 | global batch size: 16 | lm loss: 7.919844E+00 | grad norm: 1.676 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 891/ 128728 | consumed samples: 14256 | consumed tokens: 29196288 | elapsed time per iteration (s): 15.19 | learning rate: 4.671E-06 | global batch size: 16 | lm loss: 7.375011E+00 | grad norm: 2.644 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 892/ 128728 | consumed samples: 14272 | consumed tokens: 29229056 | elapsed time per iteration (s): 15.30 | learning rate: 4.677E-06 | global batch size: 16 | lm loss: 7.455361E+00 | grad norm: 1.290 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.046 | TFLOPs: 8.01 | [default7]: iteration 893/ 128728 | consumed samples: 14288 | consumed tokens: 29261824 | elapsed time per iteration (s): 15.25 | learning rate: 4.682E-06 | global batch size: 16 | lm loss: 7.363049E+00 | grad norm: 1.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 894/ 128728 | consumed samples: 14304 | consumed tokens: 29294592 | elapsed time per iteration (s): 15.22 | learning rate: 4.687E-06 | global batch size: 16 | lm loss: 7.459336E+00 | grad norm: 1.293 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 895/ 128728 | consumed samples: 14320 | consumed tokens: 29327360 | elapsed time per iteration (s): 15.26 | learning rate: 4.692E-06 | global batch size: 16 | lm loss: 7.505486E+00 | grad norm: 1.498 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 896/ 128728 | consumed samples: 14336 | consumed tokens: 29360128 | elapsed time per iteration (s): 15.21 | learning rate: 4.698E-06 | global batch size: 16 | lm loss: 7.412171E+00 | grad norm: 1.231 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 897/ 128728 | consumed samples: 14352 | consumed tokens: 29392896 | elapsed time per iteration (s): 15.29 | learning rate: 4.703E-06 | global batch size: 16 | lm loss: 7.677485E+00 | grad norm: 2.570 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.01 | [default7]: iteration 898/ 128728 | consumed samples: 14368 | consumed tokens: 29425664 | elapsed time per iteration (s): 15.31 | learning rate: 4.708E-06 | global batch size: 16 | lm loss: 7.416935E+00 | grad norm: 1.291 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.045 | TFLOPs: 8.00 | [default7]: iteration 899/ 128728 | consumed samples: 14384 | consumed tokens: 29458432 | elapsed time per iteration (s): 15.17 | learning rate: 4.713E-06 | global batch size: 16 | lm loss: 7.279807E+00 | grad norm: 2.479 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 900/ 128728 | consumed samples: 14400 | consumed tokens: 29491200 | elapsed time per iteration (s): 15.22 | learning rate: 4.719E-06 | global batch size: 16 | lm loss: 7.462852E+00 | grad norm: 1.378 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 901/ 128728 | consumed samples: 14416 | consumed tokens: 29523968 | elapsed time per iteration (s): 15.30 | learning rate: 4.724E-06 | global batch size: 16 | lm loss: 7.639120E+00 | grad norm: 1.333 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.046 | TFLOPs: 8.01 | [default7]: iteration 902/ 128728 | consumed samples: 14432 | consumed tokens: 29556736 | elapsed time per iteration (s): 15.25 | learning rate: 4.729E-06 | global batch size: 16 | lm loss: 7.405077E+00 | grad norm: 1.354 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 903/ 128728 | consumed samples: 14448 | consumed tokens: 29589504 | elapsed time per iteration (s): 15.28 | learning rate: 4.734E-06 | global batch size: 16 | lm loss: 7.423763E+00 | grad norm: 1.243 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 904/ 128728 | consumed samples: 14464 | consumed tokens: 29622272 | elapsed time per iteration (s): 15.26 | learning rate: 4.740E-06 | global batch size: 16 | lm loss: 7.548100E+00 | grad norm: 2.421 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 905/ 128728 | consumed samples: 14480 | consumed tokens: 29655040 | elapsed time per iteration (s): 15.22 | learning rate: 4.745E-06 | global batch size: 16 | lm loss: 7.505497E+00 | grad norm: 1.492 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 906/ 128728 | consumed samples: 14496 | consumed tokens: 29687808 | elapsed time per iteration (s): 15.22 | learning rate: 4.750E-06 | global batch size: 16 | lm loss: 7.657626E+00 | grad norm: 1.663 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 907/ 128728 | consumed samples: 14512 | consumed tokens: 29720576 | elapsed time per iteration (s): 15.27 | learning rate: 4.755E-06 | global batch size: 16 | lm loss: 7.370772E+00 | grad norm: 1.471 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 908/ 128728 | consumed samples: 14528 | consumed tokens: 29753344 | elapsed time per iteration (s): 15.24 | learning rate: 4.761E-06 | global batch size: 16 | lm loss: 7.308388E+00 | grad norm: 1.584 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 909/ 128728 | consumed samples: 14544 | consumed tokens: 29786112 | elapsed time per iteration (s): 15.18 | learning rate: 4.766E-06 | global batch size: 16 | lm loss: 7.730386E+00 | grad norm: 1.707 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 910/ 128728 | consumed samples: 14560 | consumed tokens: 29818880 | elapsed time per iteration (s): 15.27 | learning rate: 4.771E-06 | global batch size: 16 | lm loss: 7.448133E+00 | grad norm: 1.317 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 911/ 128728 | consumed samples: 14576 | consumed tokens: 29851648 | elapsed time per iteration (s): 15.25 | learning rate: 4.776E-06 | global batch size: 16 | lm loss: 7.687496E+00 | grad norm: 1.577 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 912/ 128728 | consumed samples: 14592 | consumed tokens: 29884416 | elapsed time per iteration (s): 15.22 | learning rate: 4.782E-06 | global batch size: 16 | lm loss: 7.360633E+00 | grad norm: 1.229 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 913/ 128728 | consumed samples: 14608 | consumed tokens: 29917184 | elapsed time per iteration (s): 15.25 | learning rate: 4.787E-06 | global batch size: 16 | lm loss: 7.608915E+00 | grad norm: 1.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.04 | [default7]: iteration 914/ 128728 | consumed samples: 14624 | consumed tokens: 29949952 | elapsed time per iteration (s): 15.23 | learning rate: 4.792E-06 | global batch size: 16 | lm loss: 7.448811E+00 | grad norm: 1.322 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 915/ 128728 | consumed samples: 14640 | consumed tokens: 29982720 | elapsed time per iteration (s): 15.21 | learning rate: 4.797E-06 | global batch size: 16 | lm loss: 7.706942E+00 | grad norm: 1.295 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 916/ 128728 | consumed samples: 14656 | consumed tokens: 30015488 | elapsed time per iteration (s): 15.24 | learning rate: 4.802E-06 | global batch size: 16 | lm loss: 7.413746E+00 | grad norm: 1.522 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 917/ 128728 | consumed samples: 14672 | consumed tokens: 30048256 | elapsed time per iteration (s): 15.22 | learning rate: 4.808E-06 | global batch size: 16 | lm loss: 7.521213E+00 | grad norm: 1.249 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 918/ 128728 | consumed samples: 14688 | consumed tokens: 30081024 | elapsed time per iteration (s): 15.26 | learning rate: 4.813E-06 | global batch size: 16 | lm loss: 7.561061E+00 | grad norm: 1.997 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 919/ 128728 | consumed samples: 14704 | consumed tokens: 30113792 | elapsed time per iteration (s): 15.22 | learning rate: 4.818E-06 | global batch size: 16 | lm loss: 7.206147E+00 | grad norm: 1.692 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 920/ 128728 | consumed samples: 14720 | consumed tokens: 30146560 | elapsed time per iteration (s): 15.23 | learning rate: 4.823E-06 | global batch size: 16 | lm loss: 7.452460E+00 | grad norm: 1.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 921/ 128728 | consumed samples: 14736 | consumed tokens: 30179328 | elapsed time per iteration (s): 15.27 | learning rate: 4.829E-06 | global batch size: 16 | lm loss: 7.200177E+00 | grad norm: 1.998 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 922/ 128728 | consumed samples: 14752 | consumed tokens: 30212096 | elapsed time per iteration (s): 15.24 | learning rate: 4.834E-06 | global batch size: 16 | lm loss: 7.294092E+00 | grad norm: 1.499 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 923/ 128728 | consumed samples: 14768 | consumed tokens: 30244864 | elapsed time per iteration (s): 15.28 | learning rate: 4.839E-06 | global batch size: 16 | lm loss: 7.780398E+00 | grad norm: 1.805 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 924/ 128728 | consumed samples: 14784 | consumed tokens: 30277632 | elapsed time per iteration (s): 15.24 | learning rate: 4.844E-06 | global batch size: 16 | lm loss: 7.456621E+00 | grad norm: 2.455 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 925/ 128728 | consumed samples: 14800 | consumed tokens: 30310400 | elapsed time per iteration (s): 15.25 | learning rate: 4.850E-06 | global batch size: 16 | lm loss: 7.646140E+00 | grad norm: 1.438 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 926/ 128728 | consumed samples: 14816 | consumed tokens: 30343168 | elapsed time per iteration (s): 15.24 | learning rate: 4.855E-06 | global batch size: 16 | lm loss: 7.587268E+00 | grad norm: 1.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 927/ 128728 | consumed samples: 14832 | consumed tokens: 30375936 | elapsed time per iteration (s): 15.27 | learning rate: 4.860E-06 | global batch size: 16 | lm loss: 7.366327E+00 | grad norm: 1.278 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 928/ 128728 | consumed samples: 14848 | consumed tokens: 30408704 | elapsed time per iteration (s): 15.20 | learning rate: 4.865E-06 | global batch size: 16 | lm loss: 7.437315E+00 | grad norm: 1.486 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 929/ 128728 | consumed samples: 14864 | consumed tokens: 30441472 | elapsed time per iteration (s): 15.18 | learning rate: 4.871E-06 | global batch size: 16 | lm loss: 7.467528E+00 | grad norm: 1.623 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 930/ 128728 | consumed samples: 14880 | consumed tokens: 30474240 | elapsed time per iteration (s): 15.21 | learning rate: 4.876E-06 | global batch size: 16 | lm loss: 7.356944E+00 | grad norm: 1.326 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 931/ 128728 | consumed samples: 14896 | consumed tokens: 30507008 | elapsed time per iteration (s): 15.24 | learning rate: 4.881E-06 | global batch size: 16 | lm loss: 7.382359E+00 | grad norm: 1.777 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 932/ 128728 | consumed samples: 14912 | consumed tokens: 30539776 | elapsed time per iteration (s): 15.24 | learning rate: 4.886E-06 | global batch size: 16 | lm loss: 7.406995E+00 | grad norm: 1.849 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 933/ 128728 | consumed samples: 14928 | consumed tokens: 30572544 | elapsed time per iteration (s): 15.20 | learning rate: 4.892E-06 | global batch size: 16 | lm loss: 7.376684E+00 | grad norm: 1.453 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 934/ 128728 | consumed samples: 14944 | consumed tokens: 30605312 | elapsed time per iteration (s): 15.24 | learning rate: 4.897E-06 | global batch size: 16 | lm loss: 7.531736E+00 | grad norm: 1.777 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 935/ 128728 | consumed samples: 14960 | consumed tokens: 30638080 | elapsed time per iteration (s): 15.23 | learning rate: 4.902E-06 | global batch size: 16 | lm loss: 7.509977E+00 | grad norm: 1.438 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 936/ 128728 | consumed samples: 14976 | consumed tokens: 30670848 | elapsed time per iteration (s): 15.24 | learning rate: 4.907E-06 | global batch size: 16 | lm loss: 7.370396E+00 | grad norm: 1.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 937/ 128728 | consumed samples: 14992 | consumed tokens: 30703616 | elapsed time per iteration (s): 15.16 | learning rate: 4.913E-06 | global batch size: 16 | lm loss: 7.500789E+00 | grad norm: 1.326 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 938/ 128728 | consumed samples: 15008 | consumed tokens: 30736384 | elapsed time per iteration (s): 15.21 | learning rate: 4.918E-06 | global batch size: 16 | lm loss: 7.531604E+00 | grad norm: 1.082 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 939/ 128728 | consumed samples: 15024 | consumed tokens: 30769152 | elapsed time per iteration (s): 15.25 | learning rate: 4.923E-06 | global batch size: 16 | lm loss: 7.307188E+00 | grad norm: 1.447 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 940/ 128728 | consumed samples: 15040 | consumed tokens: 30801920 | elapsed time per iteration (s): 15.18 | learning rate: 4.928E-06 | global batch size: 16 | lm loss: 7.548573E+00 | grad norm: 1.044 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 941/ 128728 | consumed samples: 15056 | consumed tokens: 30834688 | elapsed time per iteration (s): 15.15 | learning rate: 4.934E-06 | global batch size: 16 | lm loss: 7.376065E+00 | grad norm: 1.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 942/ 128728 | consumed samples: 15072 | consumed tokens: 30867456 | elapsed time per iteration (s): 15.26 | learning rate: 4.939E-06 | global batch size: 16 | lm loss: 7.403994E+00 | grad norm: 0.935 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 943/ 128728 | consumed samples: 15088 | consumed tokens: 30900224 | elapsed time per iteration (s): 15.19 | learning rate: 4.944E-06 | global batch size: 16 | lm loss: 7.430916E+00 | grad norm: 1.234 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 944/ 128728 | consumed samples: 15104 | consumed tokens: 30932992 | elapsed time per iteration (s): 15.23 | learning rate: 4.949E-06 | global batch size: 16 | lm loss: 7.367596E+00 | grad norm: 1.061 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 945/ 128728 | consumed samples: 15120 | consumed tokens: 30965760 | elapsed time per iteration (s): 15.22 | learning rate: 4.955E-06 | global batch size: 16 | lm loss: 7.401025E+00 | grad norm: 1.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 946/ 128728 | consumed samples: 15136 | consumed tokens: 30998528 | elapsed time per iteration (s): 15.22 | learning rate: 4.960E-06 | global batch size: 16 | lm loss: 7.536839E+00 | grad norm: 1.355 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 947/ 128728 | consumed samples: 15152 | consumed tokens: 31031296 | elapsed time per iteration (s): 15.25 | learning rate: 4.965E-06 | global batch size: 16 | lm loss: 7.108221E+00 | grad norm: 1.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 948/ 128728 | consumed samples: 15168 | consumed tokens: 31064064 | elapsed time per iteration (s): 15.23 | learning rate: 4.970E-06 | global batch size: 16 | lm loss: 7.302841E+00 | grad norm: 1.602 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 949/ 128728 | consumed samples: 15184 | consumed tokens: 31096832 | elapsed time per iteration (s): 15.27 | learning rate: 4.976E-06 | global batch size: 16 | lm loss: 7.204376E+00 | grad norm: 1.067 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 950/ 128728 | consumed samples: 15200 | consumed tokens: 31129600 | elapsed time per iteration (s): 15.22 | learning rate: 4.981E-06 | global batch size: 16 | lm loss: 7.323405E+00 | grad norm: 1.476 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 951/ 128728 | consumed samples: 15216 | consumed tokens: 31162368 | elapsed time per iteration (s): 15.27 | learning rate: 4.986E-06 | global batch size: 16 | lm loss: 7.413459E+00 | grad norm: 1.044 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 952/ 128728 | consumed samples: 15232 | consumed tokens: 31195136 | elapsed time per iteration (s): 15.21 | learning rate: 4.991E-06 | global batch size: 16 | lm loss: 7.621178E+00 | grad norm: 1.750 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 953/ 128728 | consumed samples: 15248 | consumed tokens: 31227904 | elapsed time per iteration (s): 15.20 | learning rate: 4.996E-06 | global batch size: 16 | lm loss: 7.608077E+00 | grad norm: 1.428 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 954/ 128728 | consumed samples: 15264 | consumed tokens: 31260672 | elapsed time per iteration (s): 15.26 | learning rate: 5.002E-06 | global batch size: 16 | lm loss: 7.327553E+00 | grad norm: 1.315 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 955/ 128728 | consumed samples: 15280 | consumed tokens: 31293440 | elapsed time per iteration (s): 15.17 | learning rate: 5.007E-06 | global batch size: 16 | lm loss: 7.498928E+00 | grad norm: 1.282 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 956/ 128728 | consumed samples: 15296 | consumed tokens: 31326208 | elapsed time per iteration (s): 15.25 | learning rate: 5.012E-06 | global batch size: 16 | lm loss: 7.481583E+00 | grad norm: 1.408 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 957/ 128728 | consumed samples: 15312 | consumed tokens: 31358976 | elapsed time per iteration (s): 15.19 | learning rate: 5.017E-06 | global batch size: 16 | lm loss: 7.372598E+00 | grad norm: 1.637 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 958/ 128728 | consumed samples: 15328 | consumed tokens: 31391744 | elapsed time per iteration (s): 15.18 | learning rate: 5.023E-06 | global batch size: 16 | lm loss: 7.266788E+00 | grad norm: 0.984 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 959/ 128728 | consumed samples: 15344 | consumed tokens: 31424512 | elapsed time per iteration (s): 15.21 | learning rate: 5.028E-06 | global batch size: 16 | lm loss: 7.610543E+00 | grad norm: 1.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 960/ 128728 | consumed samples: 15360 | consumed tokens: 31457280 | elapsed time per iteration (s): 15.19 | learning rate: 5.033E-06 | global batch size: 16 | lm loss: 7.411926E+00 | grad norm: 1.393 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 961/ 128728 | consumed samples: 15376 | consumed tokens: 31490048 | elapsed time per iteration (s): 15.17 | learning rate: 5.038E-06 | global batch size: 16 | lm loss: 7.298542E+00 | grad norm: 1.096 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 962/ 128728 | consumed samples: 15392 | consumed tokens: 31522816 | elapsed time per iteration (s): 15.24 | learning rate: 5.044E-06 | global batch size: 16 | lm loss: 7.530574E+00 | grad norm: 1.634 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 963/ 128728 | consumed samples: 15408 | consumed tokens: 31555584 | elapsed time per iteration (s): 15.25 | learning rate: 5.049E-06 | global batch size: 16 | lm loss: 7.191813E+00 | grad norm: 1.394 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 964/ 128728 | consumed samples: 15424 | consumed tokens: 31588352 | elapsed time per iteration (s): 15.29 | learning rate: 5.054E-06 | global batch size: 16 | lm loss: 7.466516E+00 | grad norm: 1.555 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.01 | [default7]: iteration 965/ 128728 | consumed samples: 15440 | consumed tokens: 31621120 | elapsed time per iteration (s): 15.23 | learning rate: 5.059E-06 | global batch size: 16 | lm loss: 7.481571E+00 | grad norm: 1.539 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 966/ 128728 | consumed samples: 15456 | consumed tokens: 31653888 | elapsed time per iteration (s): 15.25 | learning rate: 5.065E-06 | global batch size: 16 | lm loss: 7.445633E+00 | grad norm: 1.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 967/ 128728 | consumed samples: 15472 | consumed tokens: 31686656 | elapsed time per iteration (s): 15.26 | learning rate: 5.070E-06 | global batch size: 16 | lm loss: 7.634816E+00 | grad norm: 1.753 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 968/ 128728 | consumed samples: 15488 | consumed tokens: 31719424 | elapsed time per iteration (s): 15.26 | learning rate: 5.075E-06 | global batch size: 16 | lm loss: 7.474030E+00 | grad norm: 1.973 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 969/ 128728 | consumed samples: 15504 | consumed tokens: 31752192 | elapsed time per iteration (s): 15.25 | learning rate: 5.080E-06 | global batch size: 16 | lm loss: 7.217330E+00 | grad norm: 1.419 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 970/ 128728 | consumed samples: 15520 | consumed tokens: 31784960 | elapsed time per iteration (s): 15.25 | learning rate: 5.086E-06 | global batch size: 16 | lm loss: 7.412174E+00 | grad norm: 1.102 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 971/ 128728 | consumed samples: 15536 | consumed tokens: 31817728 | elapsed time per iteration (s): 15.19 | learning rate: 5.091E-06 | global batch size: 16 | lm loss: 7.506372E+00 | grad norm: 1.399 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 972/ 128728 | consumed samples: 15552 | consumed tokens: 31850496 | elapsed time per iteration (s): 15.17 | learning rate: 5.096E-06 | global batch size: 16 | lm loss: 7.401738E+00 | grad norm: 1.298 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 973/ 128728 | consumed samples: 15568 | consumed tokens: 31883264 | elapsed time per iteration (s): 15.16 | learning rate: 5.101E-06 | global batch size: 16 | lm loss: 7.248646E+00 | grad norm: 1.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 974/ 128728 | consumed samples: 15584 | consumed tokens: 31916032 | elapsed time per iteration (s): 15.25 | learning rate: 5.107E-06 | global batch size: 16 | lm loss: 7.523051E+00 | grad norm: 1.707 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 975/ 128728 | consumed samples: 15600 | consumed tokens: 31948800 | elapsed time per iteration (s): 15.21 | learning rate: 5.112E-06 | global batch size: 16 | lm loss: 7.623046E+00 | grad norm: 1.590 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 976/ 128728 | consumed samples: 15616 | consumed tokens: 31981568 | elapsed time per iteration (s): 15.19 | learning rate: 5.117E-06 | global batch size: 16 | lm loss: 7.583755E+00 | grad norm: 1.307 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 977/ 128728 | consumed samples: 15632 | consumed tokens: 32014336 | elapsed time per iteration (s): 15.26 | learning rate: 5.122E-06 | global batch size: 16 | lm loss: 7.316653E+00 | grad norm: 1.253 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 978/ 128728 | consumed samples: 15648 | consumed tokens: 32047104 | elapsed time per iteration (s): 15.26 | learning rate: 5.128E-06 | global batch size: 16 | lm loss: 7.298987E+00 | grad norm: 1.071 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 979/ 128728 | consumed samples: 15664 | consumed tokens: 32079872 | elapsed time per iteration (s): 15.25 | learning rate: 5.133E-06 | global batch size: 16 | lm loss: 7.467144E+00 | grad norm: 1.544 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 980/ 128728 | consumed samples: 15680 | consumed tokens: 32112640 | elapsed time per iteration (s): 15.22 | learning rate: 5.138E-06 | global batch size: 16 | lm loss: 7.399050E+00 | grad norm: 1.285 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 981/ 128728 | consumed samples: 15696 | consumed tokens: 32145408 | elapsed time per iteration (s): 15.23 | learning rate: 5.143E-06 | global batch size: 16 | lm loss: 7.307127E+00 | grad norm: 1.364 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 982/ 128728 | consumed samples: 15712 | consumed tokens: 32178176 | elapsed time per iteration (s): 15.21 | learning rate: 5.149E-06 | global batch size: 16 | lm loss: 7.372665E+00 | grad norm: 1.068 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 983/ 128728 | consumed samples: 15728 | consumed tokens: 32210944 | elapsed time per iteration (s): 15.17 | learning rate: 5.154E-06 | global batch size: 16 | lm loss: 7.395346E+00 | grad norm: 1.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 984/ 128728 | consumed samples: 15744 | consumed tokens: 32243712 | elapsed time per iteration (s): 15.26 | learning rate: 5.159E-06 | global batch size: 16 | lm loss: 7.418610E+00 | grad norm: 1.037 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 985/ 128728 | consumed samples: 15760 | consumed tokens: 32276480 | elapsed time per iteration (s): 15.22 | learning rate: 5.164E-06 | global batch size: 16 | lm loss: 7.631675E+00 | grad norm: 1.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 986/ 128728 | consumed samples: 15776 | consumed tokens: 32309248 | elapsed time per iteration (s): 15.24 | learning rate: 5.169E-06 | global batch size: 16 | lm loss: 7.382019E+00 | grad norm: 1.287 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 987/ 128728 | consumed samples: 15792 | consumed tokens: 32342016 | elapsed time per iteration (s): 15.25 | learning rate: 5.175E-06 | global batch size: 16 | lm loss: 7.357999E+00 | grad norm: 1.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 988/ 128728 | consumed samples: 15808 | consumed tokens: 32374784 | elapsed time per iteration (s): 15.18 | learning rate: 5.180E-06 | global batch size: 16 | lm loss: 7.538756E+00 | grad norm: 1.027 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 989/ 128728 | consumed samples: 15824 | consumed tokens: 32407552 | elapsed time per iteration (s): 15.23 | learning rate: 5.185E-06 | global batch size: 16 | lm loss: 7.230034E+00 | grad norm: 1.265 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 990/ 128728 | consumed samples: 15840 | consumed tokens: 32440320 | elapsed time per iteration (s): 15.23 | learning rate: 5.190E-06 | global batch size: 16 | lm loss: 7.380984E+00 | grad norm: 1.439 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 991/ 128728 | consumed samples: 15856 | consumed tokens: 32473088 | elapsed time per iteration (s): 15.23 | learning rate: 5.196E-06 | global batch size: 16 | lm loss: 7.412922E+00 | grad norm: 1.216 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 992/ 128728 | consumed samples: 15872 | consumed tokens: 32505856 | elapsed time per iteration (s): 15.23 | learning rate: 5.201E-06 | global batch size: 16 | lm loss: 7.293040E+00 | grad norm: 1.339 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 993/ 128728 | consumed samples: 15888 | consumed tokens: 32538624 | elapsed time per iteration (s): 15.14 | learning rate: 5.206E-06 | global batch size: 16 | lm loss: 7.172251E+00 | grad norm: 1.524 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 994/ 128728 | consumed samples: 15904 | consumed tokens: 32571392 | elapsed time per iteration (s): 15.27 | learning rate: 5.211E-06 | global batch size: 16 | lm loss: 7.383713E+00 | grad norm: 1.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 995/ 128728 | consumed samples: 15920 | consumed tokens: 32604160 | elapsed time per iteration (s): 15.22 | learning rate: 5.217E-06 | global batch size: 16 | lm loss: 7.343609E+00 | grad norm: 0.969 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 996/ 128728 | consumed samples: 15936 | consumed tokens: 32636928 | elapsed time per iteration (s): 15.21 | learning rate: 5.222E-06 | global batch size: 16 | lm loss: 7.478510E+00 | grad norm: 1.065 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 997/ 128728 | consumed samples: 15952 | consumed tokens: 32669696 | elapsed time per iteration (s): 15.28 | learning rate: 5.227E-06 | global batch size: 16 | lm loss: 7.494905E+00 | grad norm: 1.065 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 998/ 128728 | consumed samples: 15968 | consumed tokens: 32702464 | elapsed time per iteration (s): 15.22 | learning rate: 5.232E-06 | global batch size: 16 | lm loss: 7.248654E+00 | grad norm: 1.117 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 999/ 128728 | consumed samples: 15984 | consumed tokens: 32735232 | elapsed time per iteration (s): 15.25 | learning rate: 5.238E-06 | global batch size: 16 | lm loss: 7.334100E+00 | grad norm: 1.284 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1000/ 128728 | consumed samples: 16000 | consumed tokens: 32768000 | elapsed time per iteration (s): 15.23 | learning rate: 5.243E-06 | global batch size: 16 | lm loss: 7.241666E+00 | grad norm: 1.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]:------------------------------------------------------------------------------------------ [default7]:valid loss at iteration 1000 | lm loss value: 7.702314E+00 | lm loss PPL: 2.213464E+03 | [default7]:------------------------------------------------------------------------------------------ [default0]:saving checkpoint at iteration 1000 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default0]:[2022-03-03 10:08:59,597] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/mp_rank_00_model_states.pt [default1]:[2022-03-03 10:08:59,711] [INFO] [logging.py:69:log_dist] [Rank 1] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/mp_rank_01_model_states.pt [default4]:[2022-03-03 10:09:11,547] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_40_optim_states.pt [default3]:[2022-03-03 10:09:11,762] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_43_optim_states.pt [default0]:[2022-03-03 10:09:11,926] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_40_optim_states.pt [default7]:[2022-03-03 10:09:11,997] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_43_optim_states.pt [default1]:[2022-03-03 10:09:12,087] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_41_optim_states.pt [default4]:[2022-03-03 10:09:12,253] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_40_optim_states.pt [default5]:[2022-03-03 10:09:12,319] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_41_optim_states.pt [default6]:[2022-03-03 10:09:12,289] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_42_optim_states.pt [default2]:[2022-03-03 10:09:12,265] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_42_optim_states.pt [default5]:[2022-03-03 10:09:12,620] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_41_optim_states.pt [default3]:[2022-03-03 10:09:12,605] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_27_optim_states.pt [default5]:[2022-03-03 10:09:12,695] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_25_optim_states.pt [default6]:[2022-03-03 10:09:12,691] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_42_optim_states.pt [default2]:[2022-03-03 10:09:12,815] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_26_optim_states.pt [default2]:[2022-03-03 10:09:12,817] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_38_optim_states.pt [default1]:[2022-03-03 10:09:12,872] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_41_optim_states.pt [default3]:[2022-03-03 10:09:12,877] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_43_optim_states.pt [default7]:[2022-03-03 10:09:12,999] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_43_optim_states.pt [default0]:[2022-03-03 10:09:13,012] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_40_optim_states.pt [default4]:[2022-03-03 10:09:12,962] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_44_optim_states.pt [default6]:[2022-03-03 10:09:13,085] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_26_optim_states.pt [default7]:[2022-03-03 10:09:13,148] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_35_optim_states.pt [default2]:[2022-03-03 10:09:13,197] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_42_optim_states.pt [default0]:[2022-03-03 10:09:13,332] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_36_optim_states.pt [default6]:[2022-03-03 10:09:13,410] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_34_optim_states.pt [default3]:[2022-03-03 10:09:13,416] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_39_optim_states.pt [default6]:[2022-03-03 10:09:13,415] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_38_optim_states.pt [default0]:[2022-03-03 10:09:13,450] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_32_optim_states.pt [default4]:[2022-03-03 10:09:13,550] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_24_optim_states.pt [default1]:[2022-03-03 10:09:13,610] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_25_optim_states.pt [default0]:[2022-03-03 10:09:13,558] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_24_optim_states.pt [default7]:[2022-03-03 10:09:13,656] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_39_optim_states.pt [default5]:[2022-03-03 10:09:13,605] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_33_optim_states.pt [default1]:[2022-03-03 10:09:13,663] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_37_optim_states.pt [default5]:[2022-03-03 10:09:13,637] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_37_optim_states.pt [default3]:[2022-03-03 10:09:13,716] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_35_optim_states.pt [default7]:[2022-03-03 10:09:13,800] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_27_optim_states.pt [default1]:[2022-03-03 10:09:13,854] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_33_optim_states.pt [default4]:[2022-03-03 10:09:13,916] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_36_optim_states.pt [default0]:[2022-03-03 10:09:13,972] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_28_optim_states.pt [default2]:[2022-03-03 10:09:14,004] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_34_optim_states.pt [default5]:[2022-03-03 10:09:14,025] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_45_optim_states.pt [default3]:[2022-03-03 10:09:14,078] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_35_optim_states.pt [default5]:[2022-03-03 10:09:14,227] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_17_optim_states.pt [default4]:[2022-03-03 10:09:14,242] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_32_optim_states.pt [default1]:[2022-03-03 10:09:14,270] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_25_optim_states.pt [default0]:[2022-03-03 10:09:14,591] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_08_optim_states.pt [default1]:[2022-03-03 10:09:14,636] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_09_optim_states.pt [default4]:[2022-03-03 10:09:14,663] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_24_optim_states.pt [default6]:[2022-03-03 10:09:14,775] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_26_optim_states.pt [default2]:[2022-03-03 10:09:14,877] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_34_optim_states.pt [default3]:[2022-03-03 10:09:14,879] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_31_optim_states.pt [default4]:[2022-03-03 10:09:14,941] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_32_optim_states.pt [default0]:[2022-03-03 10:09:14,975] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_24_optim_states.pt [default7]:[2022-03-03 10:09:15,002] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_27_optim_states.pt [default1]:[2022-03-03 10:09:15,042] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_21_optim_states.pt [default5]:[2022-03-03 10:09:15,204] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_09_optim_states.pt [default2]:[2022-03-03 10:09:15,234] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_26_optim_states.pt [default7]:[2022-03-03 10:09:15,188] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_15_optim_states.pt [default5]:[2022-03-03 10:09:15,269] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_13_optim_states.pt [default2]:[2022-03-03 10:09:15,367] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_10_optim_states.pt [default5]:[2022-03-03 10:09:15,492] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_25_optim_states.pt [default3]:[2022-03-03 10:09:15,517] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_27_optim_states.pt [default2]:[2022-03-03 10:09:15,601] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_18_optim_states.pt [default3]:[2022-03-03 10:09:15,683] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_19_optim_states.pt [default1]:[2022-03-03 10:09:15,629] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_17_optim_states.pt [default1]:[2022-03-03 10:09:15,684] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_29_optim_states.pt [default2]:[2022-03-03 10:09:15,625] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_30_optim_states.pt [default4]:[2022-03-03 10:09:15,686] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_16_optim_states.pt [default0]:[2022-03-03 10:09:15,686] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_16_optim_states.pt [default5]:[2022-03-03 10:09:15,608] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_17_optim_states.pt [default1]:[2022-03-03 10:09:15,766] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_09_optim_states.pt [default0]:[2022-03-03 10:09:15,728] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_08_optim_states.pt [default6]:[2022-03-03 10:09:15,867] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_14_optim_states.pt [default6]:[2022-03-03 10:09:15,976] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_18_optim_states.pt [default4]:[2022-03-03 10:09:16,005] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_08_optim_states.pt [default3]:[2022-03-03 10:09:16,009] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_11_optim_states.pt [default7]:[2022-03-03 10:09:16,060] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_19_optim_states.pt [default6]:[2022-03-03 10:09:16,127] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_46_optim_states.pt [default7]:[2022-03-03 10:09:16,094] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_11_optim_states.pt [default5]:[2022-03-03 10:09:16,192] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt [default2]:[2022-03-03 10:09:16,188] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_38_optim_states.pt [default6]:[2022-03-03 10:09:16,187] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_10_optim_states.pt [default7]:[2022-03-03 10:09:16,295] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_15_optim_states.pt [default3]:[2022-03-03 10:09:16,292] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_15_optim_states.pt [default6]:[2022-03-03 10:09:16,334] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_14_optim_states.pt [default3]:[2022-03-03 10:09:16,405] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_23_optim_states.pt [default4]:[2022-03-03 10:09:16,348] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_12_optim_states.pt [default4]:[2022-03-03 10:09:16,485] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_40_optim_states.pt [default6]:[2022-03-03 10:09:16,508] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_22_optim_states.pt [default5]:[2022-03-03 10:09:16,533] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_33_optim_states.pt [default6]:[2022-03-03 10:09:16,539] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_10_optim_states.pt [default7]:[2022-03-03 10:09:16,580] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_15_optim_states.pt [default2]:[2022-03-03 10:09:16,594] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_10_optim_states.pt [default7]:[2022-03-03 10:09:16,561] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_11_optim_states.pt [default5]:[2022-03-03 10:09:16,731] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_37_optim_states.pt [default3]:[2022-03-03 10:09:16,743] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_15_optim_states.pt [default6]:[2022-03-03 10:09:16,908] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_14_optim_states.pt [default2]:[2022-03-03 10:09:16,897] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_14_optim_states.pt [default2]:[2022-03-03 10:09:17,053] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_26_optim_states.pt [default7]:[2022-03-03 10:09:17,102] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_35_optim_states.pt [default0]:[2022-03-03 10:09:17,105] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [default4]:[2022-03-03 10:09:17,064] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt [default6]:[2022-03-03 10:09:17,141] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_30_optim_states.pt [default3]:[2022-03-03 10:09:17,192] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_11_optim_states.pt [default0]:[2022-03-03 10:09:17,179] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_12_optim_states.pt [default3]:[2022-03-03 10:09:17,317] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_15_optim_states.pt [default7]:[2022-03-03 10:09:17,335] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_47_optim_states.pt [default2]:[2022-03-03 10:09:17,382] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_14_optim_states.pt [default1]:[2022-03-03 10:09:17,402] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_13_optim_states.pt [default3]:[2022-03-03 10:09:17,398] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_39_optim_states.pt [default5]:[2022-03-03 10:09:17,448] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_13_optim_states.pt [default3]:[2022-03-03 10:09:17,484] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_23_optim_states.pt [default1]:[2022-03-03 10:09:17,549] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_17_optim_states.pt [default0]:[2022-03-03 10:09:17,555] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_12_optim_states.pt [default0]:[2022-03-03 10:09:17,573] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_16_optim_states.pt [default7]:[2022-03-03 10:09:17,667] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_39_optim_states.pt [default7]:[2022-03-03 10:09:17,667] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_31_optim_states.pt [default2]:[2022-03-03 10:09:17,677] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_46_optim_states.pt [default4]:[2022-03-03 10:09:17,618] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_16_optim_states.pt [default6]:[2022-03-03 10:09:17,745] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_38_optim_states.pt [default4]:[2022-03-03 10:09:17,716] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_36_optim_states.pt [default4]:[2022-03-03 10:09:17,686] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_12_optim_states.pt [default3]:[2022-03-03 10:09:17,790] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_47_optim_states.pt [default1]:[2022-03-03 10:09:17,828] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_13_optim_states.pt [default6]:[2022-03-03 10:09:17,803] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_34_optim_states.pt [default0]:[2022-03-03 10:09:17,796] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_12_optim_states.pt [default2]:[2022-03-03 10:09:17,797] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_14_optim_states.pt [default5]:[2022-03-03 10:09:17,887] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_13_optim_states.pt [default6]:[2022-03-03 10:09:17,894] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_30_optim_states.pt [default7]:[2022-03-03 10:09:17,913] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_47_optim_states.pt [default4]:[2022-03-03 10:09:17,843] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt [default0]:[2022-03-03 10:09:17,902] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_36_optim_states.pt [default1]:[2022-03-03 10:09:17,886] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_37_optim_states.pt [default0]:[2022-03-03 10:09:17,882] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt [default1]:[2022-03-03 10:09:17,943] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_13_optim_states.pt [default4]:[2022-03-03 10:09:18,046] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_12_optim_states.pt [default4]:[2022-03-03 10:09:18,251] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_28_optim_states.pt [default1]:[2022-03-03 10:09:18,301] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_13_optim_states.pt [default0]:[2022-03-03 10:09:18,273] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_12_optim_states.pt [default5]:[2022-03-03 10:09:18,280] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_29_optim_states.pt [default3]:[2022-03-03 10:09:18,397] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_15_optim_states.pt [default2]:[2022-03-03 10:09:18,337] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_14_optim_states.pt [default0]:[2022-03-03 10:09:18,369] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_20_optim_states.pt [default7]:[2022-03-03 10:09:18,371] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_15_optim_states.pt [default6]:[2022-03-03 10:09:18,442] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_18_optim_states.pt [default6]:[2022-03-03 10:09:18,411] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_14_optim_states.pt [default3]:[2022-03-03 10:09:18,389] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_43_optim_states.pt [default1]:[2022-03-03 10:09:18,418] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_33_optim_states.pt [default7]:[2022-03-03 10:09:18,440] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_43_optim_states.pt [default3]:[2022-03-03 10:09:18,461] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_03_optim_states.pt [default0]:[2022-03-03 10:09:18,457] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_32_optim_states.pt [default7]:[2022-03-03 10:09:18,459] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_23_optim_states.pt [default6]:[2022-03-03 10:09:18,511] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_42_optim_states.pt [default6]:[2022-03-03 10:09:18,500] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_34_optim_states.pt [default2]:[2022-03-03 10:09:18,502] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_42_optim_states.pt [default6]:[2022-03-03 10:09:18,505] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_22_optim_states.pt [default1]:[2022-03-03 10:09:18,597] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_21_optim_states.pt [default4]:[2022-03-03 10:09:18,593] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_12_optim_states.pt [default7]:[2022-03-03 10:09:18,609] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_19_optim_states.pt [default3]:[2022-03-03 10:09:18,676] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_43_optim_states.pt [default5]:[2022-03-03 10:09:18,622] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_01_optim_states.pt [default0]:[2022-03-03 10:09:18,651] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_20_optim_states.pt [default2]:[2022-03-03 10:09:18,690] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_22_optim_states.pt [default2]:[2022-03-03 10:09:18,689] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_22_optim_states.pt [default3]:[2022-03-03 10:09:18,697] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_19_optim_states.pt [default3]:[2022-03-03 10:09:18,703] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_35_optim_states.pt [default5]:[2022-03-03 10:09:18,771] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_13_optim_states.pt [default4]:[2022-03-03 10:09:18,735] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_08_optim_states.pt [default5]:[2022-03-03 10:09:18,739] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_09_optim_states.pt [default0]:[2022-03-03 10:09:18,817] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_20_optim_states.pt [default0]:[2022-03-03 10:09:18,745] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_32_optim_states.pt [default3]:[2022-03-03 10:09:18,860] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_03_optim_states.pt [default2]:[2022-03-03 10:09:18,882] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_02_optim_states.pt [default7]:[2022-03-03 10:09:19,029] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_23_optim_states.pt [default1]:[2022-03-03 10:09:18,996] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_05_optim_states.pt [default5]:[2022-03-03 10:09:18,998] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_05_optim_states.pt [default6]:[2022-03-03 10:09:19,047] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_22_optim_states.pt [default2]:[2022-03-03 10:09:19,086] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_06_optim_states.pt [default6]:[2022-03-03 10:09:19,076] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_06_optim_states.pt [default1]:[2022-03-03 10:09:19,098] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_01_optim_states.pt [default0]:[2022-03-03 10:09:19,066] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_04_optim_states.pt [default2]:[2022-03-03 10:09:19,105] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_42_optim_states.pt [default5]:[2022-03-03 10:09:19,133] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_41_optim_states.pt [default4]:[2022-03-03 10:09:19,179] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_08_optim_states.pt [default7]:[2022-03-03 10:09:19,171] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_07_optim_states.pt [default5]:[2022-03-03 10:09:19,231] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_41_optim_states.pt [default2]:[2022-03-03 10:09:19,261] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_18_optim_states.pt [default7]:[2022-03-03 10:09:19,352] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_11_optim_states.pt [default1]:[2022-03-03 10:09:19,340] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_41_optim_states.pt [default3]:[2022-03-03 10:09:19,330] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_19_optim_states.pt [default1]:[2022-03-03 10:09:19,253] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_05_optim_states.pt [default6]:[2022-03-03 10:09:19,336] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_10_optim_states.pt [default1]:[2022-03-03 10:09:19,283] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_01_optim_states.pt [default2]:[2022-03-03 10:09:19,470] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_34_optim_states.pt [default4]:[2022-03-03 10:09:19,436] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_32_optim_states.pt [default1]:[2022-03-03 10:09:19,421] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_33_optim_states.pt [default2]:[2022-03-03 10:09:19,540] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_18_optim_states.pt [default0]:[2022-03-03 10:09:19,428] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt [default6]:[2022-03-03 10:09:19,548] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_46_optim_states.pt [default7]:[2022-03-03 10:09:19,537] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_47_optim_states.pt [default5]:[2022-03-03 10:09:19,638] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_33_optim_states.pt [default3]:[2022-03-03 10:09:19,638] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_27_optim_states.pt [default3]:[2022-03-03 10:09:19,635] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_35_optim_states.pt [default0]:[2022-03-03 10:09:19,608] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_44_optim_states.pt [default1]:[2022-03-03 10:09:19,682] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_45_optim_states.pt [default0]:[2022-03-03 10:09:19,703] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_40_optim_states.pt [default6]:[2022-03-03 10:09:19,680] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_34_optim_states.pt [default0]:[2022-03-03 10:09:19,754] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt [default0]:[2022-03-03 10:09:19,734] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_24_optim_states.pt [default1]:[2022-03-03 10:09:19,718] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_01_optim_states.pt [default2]:[2022-03-03 10:09:19,732] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_02_optim_states.pt [default7]:[2022-03-03 10:09:19,743] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_35_optim_states.pt [default1]:[2022-03-03 10:09:19,799] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_45_optim_states.pt [default4]:[2022-03-03 10:09:19,841] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_40_optim_states.pt [default0]:[2022-03-03 10:09:19,824] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_44_optim_states.pt [default0]:[2022-03-03 10:09:20,058] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_32_optim_states.pt [default7]:[2022-03-03 10:09:20,070] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_31_optim_states.pt [default1]:[2022-03-03 10:09:20,101] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_25_optim_states.pt [default1]:[2022-03-03 10:09:20,117] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_29_optim_states.pt [default3]:[2022-03-03 10:09:20,207] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_31_optim_states.pt [default4]:[2022-03-03 10:09:20,152] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_24_optim_states.pt [default1]:[2022-03-03 10:09:20,230] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_01_optim_states.pt [default7]:[2022-03-03 10:09:20,240] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_03_optim_states.pt [default5]:[2022-03-03 10:09:20,269] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_25_optim_states.pt [default1]:[2022-03-03 10:09:20,244] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_33_optim_states.pt [default7]:[2022-03-03 10:09:20,355] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_35_optim_states.pt [default6]:[2022-03-03 10:09:20,302] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_02_optim_states.pt [default5]:[2022-03-03 10:09:20,269] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_05_optim_states.pt [default0]:[2022-03-03 10:09:20,408] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_24_optim_states.pt [default7]:[2022-03-03 10:09:20,566] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_39_optim_states.pt [default5]:[2022-03-03 10:09:20,623] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_17_optim_states.pt [default4]:[2022-03-03 10:09:20,593] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_16_optim_states.pt [default3]:[2022-03-03 10:09:20,811] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_07_optim_states.pt [default4]:[2022-03-03 10:09:20,811] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_04_optim_states.pt [default2]:[2022-03-03 10:09:20,820] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_34_optim_states.pt [default6]:[2022-03-03 10:09:20,887] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_42_optim_states.pt [default2]:[2022-03-03 10:09:20,861] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_38_optim_states.pt [default4]:[2022-03-03 10:09:20,818] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_04_optim_states.pt [default6]:[2022-03-03 10:09:20,733] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_18_optim_states.pt [default0]:[2022-03-03 10:09:20,920] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_04_optim_states.pt [default2]:[2022-03-03 10:09:20,968] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_02_optim_states.pt [default7]:[2022-03-03 10:09:20,908] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_07_optim_states.pt [default6]:[2022-03-03 10:09:20,935] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_38_optim_states.pt [default2]:[2022-03-03 10:09:21,024] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_26_optim_states.pt [default4]:[2022-03-03 10:09:21,170] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_20_optim_states.pt [default7]:[2022-03-03 10:09:21,132] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_03_optim_states.pt [default5]:[2022-03-03 10:09:21,145] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_21_optim_states.pt [default4]:[2022-03-03 10:09:21,214] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_08_optim_states.pt [default1]:[2022-03-03 10:09:21,212] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_41_optim_states.pt [default7]:[2022-03-03 10:09:21,251] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_11_optim_states.pt [default0]:[2022-03-03 10:09:21,250] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_40_optim_states.pt [default3]:[2022-03-03 10:09:21,295] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_31_optim_states.pt [default5]:[2022-03-03 10:09:21,361] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_21_optim_states.pt [default7]:[2022-03-03 10:09:21,398] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_43_optim_states.pt [default2]:[2022-03-03 10:09:21,317] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_30_optim_states.pt [default4]:[2022-03-03 10:09:21,353] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_20_optim_states.pt [default6]:[2022-03-03 10:09:21,534] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_46_optim_states.pt [default7]:[2022-03-03 10:09:21,464] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_23_optim_states.pt [default3]:[2022-03-03 10:09:21,518] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_07_optim_states.pt [default5]:[2022-03-03 10:09:21,549] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_09_optim_states.pt [default1]:[2022-03-03 10:09:21,530] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_21_optim_states.pt [default5]:[2022-03-03 10:09:21,554] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_05_optim_states.pt [default6]:[2022-03-03 10:09:21,599] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_22_optim_states.pt [default2]:[2022-03-03 10:09:21,583] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_30_optim_states.pt [default5]:[2022-03-03 10:09:21,635] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_01_optim_states.pt [default1]:[2022-03-03 10:09:21,662] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_21_optim_states.pt [default2]:[2022-03-03 10:09:21,663] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_10_optim_states.pt [default6]:[2022-03-03 10:09:21,625] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_02_optim_states.pt [default2]:[2022-03-03 10:09:21,655] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_22_optim_states.pt [default5]:[2022-03-03 10:09:21,675] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_33_optim_states.pt [default4]:[2022-03-03 10:09:21,650] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_28_optim_states.pt [default7]:[2022-03-03 10:09:21,707] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_23_optim_states.pt [default4]:[2022-03-03 10:09:21,699] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_32_optim_states.pt [default6]:[2022-03-03 10:09:21,748] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_06_optim_states.pt [default3]:[2022-03-03 10:09:21,749] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_31_optim_states.pt [default5]:[2022-03-03 10:09:21,745] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_37_optim_states.pt [default4]:[2022-03-03 10:09:21,800] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_20_optim_states.pt [default4]:[2022-03-03 10:09:21,816] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_36_optim_states.pt [default0]:[2022-03-03 10:09:21,860] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_28_optim_states.pt [default2]:[2022-03-03 10:09:21,986] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_22_optim_states.pt [default7]:[2022-03-03 10:09:22,015] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_19_optim_states.pt [default1]:[2022-03-03 10:09:22,060] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_05_optim_states.pt [default0]:[2022-03-03 10:09:22,016] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_20_optim_states.pt [default3]:[2022-03-03 10:09:21,988] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_23_optim_states.pt [default2]:[2022-03-03 10:09:21,993] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_46_optim_states.pt [default1]:[2022-03-03 10:09:22,023] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_29_optim_states.pt [default6]:[2022-03-03 10:09:22,045] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_18_optim_states.pt [default4]:[2022-03-03 10:09:22,134] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_20_optim_states.pt [default5]:[2022-03-03 10:09:22,127] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_21_optim_states.pt [default4]:[2022-03-03 10:09:22,234] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt [default5]:[2022-03-03 10:09:22,210] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_29_optim_states.pt [default0]:[2022-03-03 10:09:22,238] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_04_optim_states.pt [default5]:[2022-03-03 10:09:22,309] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_21_optim_states.pt [default3]:[2022-03-03 10:09:22,251] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_23_optim_states.pt [default3]:[2022-03-03 10:09:22,287] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_03_optim_states.pt [default7]:[2022-03-03 10:09:22,260] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_19_optim_states.pt [default3]:[2022-03-03 10:09:22,417] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_27_optim_states.pt [default1]:[2022-03-03 10:09:22,478] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_17_optim_states.pt [default6]:[2022-03-03 10:09:22,566] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_02_optim_states.pt [default7]:[2022-03-03 10:09:22,525] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_03_optim_states.pt [default0]:[2022-03-03 10:09:22,576] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_28_optim_states.pt [default0]:[2022-03-03 10:09:22,523] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_16_optim_states.pt [default6]:[2022-03-03 10:09:22,574] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_10_optim_states.pt [default0]:[2022-03-03 10:09:22,627] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_08_optim_states.pt [default7]:[2022-03-03 10:09:22,694] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_27_optim_states.pt [default6]:[2022-03-03 10:09:22,660] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_26_optim_states.pt [default2]:[2022-03-03 10:09:22,896] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_46_optim_states.pt [default3]:[2022-03-03 10:09:22,921] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_19_optim_states.pt [default2]:[2022-03-03 10:09:22,964] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_18_optim_states.pt [default3]:[2022-03-03 10:09:22,990] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_47_optim_states.pt [default2]:[2022-03-03 10:09:23,028] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_10_optim_states.pt [default3]:[2022-03-03 10:09:23,053] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_11_optim_states.pt [default1]:[2022-03-03 10:09:23,107] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_25_optim_states.pt [default0]:[2022-03-03 10:09:23,153] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_44_optim_states.pt [default1]:[2022-03-03 10:09:23,183] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_45_optim_states.pt [default2]:[2022-03-03 10:09:23,109] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_38_optim_states.pt [default0]:[2022-03-03 10:09:23,167] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_16_optim_states.pt [default3]:[2022-03-03 10:09:23,247] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_39_optim_states.pt [default2]:[2022-03-03 10:09:23,256] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_46_optim_states.pt [default1]:[2022-03-03 10:09:23,533] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_17_optim_states.pt [default5]:[2022-03-03 10:09:23,501] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_29_optim_states.pt [default3]:[2022-03-03 10:09:23,569] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_07_optim_states.pt [default2]:[2022-03-03 10:09:23,575] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_06_optim_states.pt [default4]:[2022-03-03 10:09:23,607] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_04_optim_states.pt [default3]:[2022-03-03 10:09:23,561] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_47_optim_states.pt [default1]:[2022-03-03 10:09:23,593] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_37_optim_states.pt [default4]:[2022-03-03 10:09:23,667] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_44_optim_states.pt [default2]:[2022-03-03 10:09:23,664] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_06_optim_states.pt [default0]:[2022-03-03 10:09:23,669] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_36_optim_states.pt [default1]:[2022-03-03 10:09:23,738] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_37_optim_states.pt [default2]:[2022-03-03 10:09:23,733] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_30_optim_states.pt [default0]:[2022-03-03 10:09:23,746] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_36_optim_states.pt [default5]:[2022-03-03 10:09:23,782] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_09_optim_states.pt [default3]:[2022-03-03 10:09:23,782] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_39_optim_states.pt [default6]:[2022-03-03 10:09:23,852] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_38_optim_states.pt [default7]:[2022-03-03 10:09:23,994] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_31_optim_states.pt [default5]:[2022-03-03 10:09:24,052] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_45_optim_states.pt [default6]:[2022-03-03 10:09:24,056] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_30_optim_states.pt [default5]:[2022-03-03 10:09:24,215] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_37_optim_states.pt [default4]:[2022-03-03 10:09:24,236] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_36_optim_states.pt [default1]:[2022-03-03 10:09:24,395] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_09_optim_states.pt [default4]:[2022-03-03 10:09:24,354] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_28_optim_states.pt [default3]:[2022-03-03 10:09:24,408] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_11_optim_states.pt [default7]:[2022-03-03 10:09:24,395] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_39_optim_states.pt [default6]:[2022-03-03 10:09:24,787] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_02_optim_states.pt [default7]:[2022-03-03 10:09:24,817] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_03_optim_states.pt [default5]:[2022-03-03 10:09:24,854] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_45_optim_states.pt [default4]:[2022-03-03 10:09:24,862] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_44_optim_states.pt [default2]:[2022-03-03 10:09:24,970] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_06_optim_states.pt [default7]:[2022-03-03 10:09:25,068] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_47_optim_states.pt [default0]:[2022-03-03 10:09:25,020] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_44_optim_states.pt [default6]:[2022-03-03 10:09:25,107] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_46_optim_states.pt [default7]:[2022-03-03 10:09:25,161] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_27_optim_states.pt [default6]:[2022-03-03 10:09:25,158] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_26_optim_states.pt [default2]:[2022-03-03 10:09:25,164] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_02_optim_states.pt [default3]:[2022-03-03 10:09:25,259] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_47_optim_states.pt [default3]:[2022-03-03 10:09:25,521] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_03_optim_states.pt [default4]:[2022-03-03 10:09:25,488] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_44_optim_states.pt [default5]:[2022-03-03 10:09:25,524] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_25_optim_states.pt [default4]:[2022-03-03 10:09:25,477] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_16_optim_states.pt [default4]:[2022-03-03 10:09:25,492] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_24_optim_states.pt [default5]:[2022-03-03 10:09:25,529] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_01_optim_states.pt [default5]:[2022-03-03 10:09:25,538] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_17_optim_states.pt [default7]:[2022-03-03 10:09:25,574] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_07_optim_states.pt [default3]:[2022-03-03 10:09:25,551] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_6_mp_rank_07_optim_states.pt [default6]:[2022-03-03 10:09:25,640] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_06_optim_states.pt [default1]:[2022-03-03 10:09:25,777] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_2_mp_rank_45_optim_states.pt [default1]:[2022-03-03 10:09:26,107] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_09_optim_states.pt [default0]:[2022-03-03 10:09:26,161] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_08_optim_states.pt [default5]:[2022-03-03 10:09:26,263] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_3_mp_rank_45_optim_states.pt [default4]:[2022-03-03 10:09:26,338] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt [default4]:[2022-03-03 10:09:26,406] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_28_optim_states.pt [default7]:[2022-03-03 10:09:26,461] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_31_optim_states.pt [default5]:[2022-03-03 10:09:26,497] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_29_optim_states.pt [default6]:[2022-03-03 10:09:26,564] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_5_mp_rank_30_optim_states.pt [default5]:[2022-03-03 10:09:26,896] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_05_optim_states.pt [default1]:[2022-03-03 10:09:26,901] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_29_optim_states.pt [default4]:[2022-03-03 10:09:26,936] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_7_mp_rank_04_optim_states.pt [default0]:[2022-03-03 10:09:27,557] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_4_mp_rank_28_optim_states.pt [default6]:[2022-03-03 10:09:27,908] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_06_optim_states.pt [default7]:[2022-03-03 10:09:27,958] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_1_mp_rank_07_optim_states.pt [default1]:[2022-03-03 10:09:28,279] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_05_optim_states.pt [default0]:[2022-03-03 10:09:28,249] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1000/bf16_zero_pp_rank_0_mp_rank_04_optim_states.pt [default7]:time (ms) | save-checkpoint: 35998.51 [default0]: successfully saved checkpoint at iteration 1000 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default7]: iteration 1001/ 128728 | consumed samples: 16016 | consumed tokens: 32800768 | elapsed time per iteration (s): 70.83 | learning rate: 5.248E-06 | global batch size: 16 | lm loss: 7.257627E+00 | grad norm: 1.325 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 0.226 | TFLOPs: 1.73 | [default7]: iteration 1002/ 128728 | consumed samples: 16032 | consumed tokens: 32833536 | elapsed time per iteration (s): 15.28 | learning rate: 5.253E-06 | global batch size: 16 | lm loss: 7.265201E+00 | grad norm: 1.279 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 1003/ 128728 | consumed samples: 16048 | consumed tokens: 32866304 | elapsed time per iteration (s): 15.25 | learning rate: 5.259E-06 | global batch size: 16 | lm loss: 7.525159E+00 | grad norm: 1.773 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1004/ 128728 | consumed samples: 16064 | consumed tokens: 32899072 | elapsed time per iteration (s): 15.25 | learning rate: 5.264E-06 | global batch size: 16 | lm loss: 7.367915E+00 | grad norm: 1.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1005/ 128728 | consumed samples: 16080 | consumed tokens: 32931840 | elapsed time per iteration (s): 15.26 | learning rate: 5.269E-06 | global batch size: 16 | lm loss: 7.435073E+00 | grad norm: 1.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1006/ 128728 | consumed samples: 16096 | consumed tokens: 32964608 | elapsed time per iteration (s): 15.24 | learning rate: 5.274E-06 | global batch size: 16 | lm loss: 7.265368E+00 | grad norm: 1.547 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1007/ 128728 | consumed samples: 16112 | consumed tokens: 32997376 | elapsed time per iteration (s): 15.24 | learning rate: 5.280E-06 | global batch size: 16 | lm loss: 7.300901E+00 | grad norm: 0.991 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1008/ 128728 | consumed samples: 16128 | consumed tokens: 33030144 | elapsed time per iteration (s): 15.24 | learning rate: 5.285E-06 | global batch size: 16 | lm loss: 7.472819E+00 | grad norm: 1.289 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1009/ 128728 | consumed samples: 16144 | consumed tokens: 33062912 | elapsed time per iteration (s): 15.25 | learning rate: 5.290E-06 | global batch size: 16 | lm loss: 7.227314E+00 | grad norm: 1.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1010/ 128728 | consumed samples: 16160 | consumed tokens: 33095680 | elapsed time per iteration (s): 15.22 | learning rate: 5.295E-06 | global batch size: 16 | lm loss: 7.344738E+00 | grad norm: 0.920 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1011/ 128728 | consumed samples: 16176 | consumed tokens: 33128448 | elapsed time per iteration (s): 15.26 | learning rate: 5.301E-06 | global batch size: 16 | lm loss: 7.324342E+00 | grad norm: 1.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1012/ 128728 | consumed samples: 16192 | consumed tokens: 33161216 | elapsed time per iteration (s): 15.24 | learning rate: 5.306E-06 | global batch size: 16 | lm loss: 7.071029E+00 | grad norm: 1.354 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1013/ 128728 | consumed samples: 16208 | consumed tokens: 33193984 | elapsed time per iteration (s): 15.25 | learning rate: 5.311E-06 | global batch size: 16 | lm loss: 7.107207E+00 | grad norm: 0.975 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1014/ 128728 | consumed samples: 16224 | consumed tokens: 33226752 | elapsed time per iteration (s): 15.26 | learning rate: 5.316E-06 | global batch size: 16 | lm loss: 7.222437E+00 | grad norm: 1.220 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1015/ 128728 | consumed samples: 16240 | consumed tokens: 33259520 | elapsed time per iteration (s): 15.25 | learning rate: 5.322E-06 | global batch size: 16 | lm loss: 7.451645E+00 | grad norm: 2.238 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1016/ 128728 | consumed samples: 16256 | consumed tokens: 33292288 | elapsed time per iteration (s): 15.19 | learning rate: 5.327E-06 | global batch size: 16 | lm loss: 7.183714E+00 | grad norm: 1.511 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 1017/ 128728 | consumed samples: 16272 | consumed tokens: 33325056 | elapsed time per iteration (s): 15.26 | learning rate: 5.332E-06 | global batch size: 16 | lm loss: 7.206068E+00 | grad norm: 1.397 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1018/ 128728 | consumed samples: 16288 | consumed tokens: 33357824 | elapsed time per iteration (s): 15.25 | learning rate: 5.337E-06 | global batch size: 16 | lm loss: 7.339333E+00 | grad norm: 1.075 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1019/ 128728 | consumed samples: 16304 | consumed tokens: 33390592 | elapsed time per iteration (s): 15.24 | learning rate: 5.343E-06 | global batch size: 16 | lm loss: 7.346642E+00 | grad norm: 1.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1020/ 128728 | consumed samples: 16320 | consumed tokens: 33423360 | elapsed time per iteration (s): 15.26 | learning rate: 5.348E-06 | global batch size: 16 | lm loss: 7.557926E+00 | grad norm: 1.374 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1021/ 128728 | consumed samples: 16336 | consumed tokens: 33456128 | elapsed time per iteration (s): 15.20 | learning rate: 5.353E-06 | global batch size: 16 | lm loss: 7.477837E+00 | grad norm: 1.650 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1022/ 128728 | consumed samples: 16352 | consumed tokens: 33488896 | elapsed time per iteration (s): 15.21 | learning rate: 5.358E-06 | global batch size: 16 | lm loss: 7.073501E+00 | grad norm: 1.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1023/ 128728 | consumed samples: 16368 | consumed tokens: 33521664 | elapsed time per iteration (s): 15.15 | learning rate: 5.363E-06 | global batch size: 16 | lm loss: 7.267119E+00 | grad norm: 1.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 1024/ 128728 | consumed samples: 16384 | consumed tokens: 33554432 | elapsed time per iteration (s): 15.23 | learning rate: 5.369E-06 | global batch size: 16 | lm loss: 7.294874E+00 | grad norm: 1.535 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1025/ 128728 | consumed samples: 16400 | consumed tokens: 33587200 | elapsed time per iteration (s): 15.25 | learning rate: 5.374E-06 | global batch size: 16 | lm loss: 7.133692E+00 | grad norm: 1.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1026/ 128728 | consumed samples: 16416 | consumed tokens: 33619968 | elapsed time per iteration (s): 15.22 | learning rate: 5.379E-06 | global batch size: 16 | lm loss: 7.371020E+00 | grad norm: 1.414 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1027/ 128728 | consumed samples: 16432 | consumed tokens: 33652736 | elapsed time per iteration (s): 15.24 | learning rate: 5.384E-06 | global batch size: 16 | lm loss: 7.288789E+00 | grad norm: 1.520 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1028/ 128728 | consumed samples: 16448 | consumed tokens: 33685504 | elapsed time per iteration (s): 15.30 | learning rate: 5.390E-06 | global batch size: 16 | lm loss: 7.304897E+00 | grad norm: 1.235 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.046 | TFLOPs: 8.01 | [default7]: iteration 1029/ 128728 | consumed samples: 16464 | consumed tokens: 33718272 | elapsed time per iteration (s): 15.26 | learning rate: 5.395E-06 | global batch size: 16 | lm loss: 7.384569E+00 | grad norm: 1.106 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1030/ 128728 | consumed samples: 16480 | consumed tokens: 33751040 | elapsed time per iteration (s): 15.26 | learning rate: 5.400E-06 | global batch size: 16 | lm loss: 7.309175E+00 | grad norm: 0.950 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1031/ 128728 | consumed samples: 16496 | consumed tokens: 33783808 | elapsed time per iteration (s): 15.24 | learning rate: 5.405E-06 | global batch size: 16 | lm loss: 7.343480E+00 | grad norm: 1.259 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1032/ 128728 | consumed samples: 16512 | consumed tokens: 33816576 | elapsed time per iteration (s): 15.24 | learning rate: 5.411E-06 | global batch size: 16 | lm loss: 7.319173E+00 | grad norm: 1.307 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1033/ 128728 | consumed samples: 16528 | consumed tokens: 33849344 | elapsed time per iteration (s): 15.23 | learning rate: 5.416E-06 | global batch size: 16 | lm loss: 7.423133E+00 | grad norm: 1.053 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1034/ 128728 | consumed samples: 16544 | consumed tokens: 33882112 | elapsed time per iteration (s): 15.21 | learning rate: 5.421E-06 | global batch size: 16 | lm loss: 7.386244E+00 | grad norm: 1.324 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1035/ 128728 | consumed samples: 16560 | consumed tokens: 33914880 | elapsed time per iteration (s): 15.25 | learning rate: 5.426E-06 | global batch size: 16 | lm loss: 7.329965E+00 | grad norm: 1.246 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1036/ 128728 | consumed samples: 16576 | consumed tokens: 33947648 | elapsed time per iteration (s): 15.26 | learning rate: 5.432E-06 | global batch size: 16 | lm loss: 7.282664E+00 | grad norm: 1.586 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1037/ 128728 | consumed samples: 16592 | consumed tokens: 33980416 | elapsed time per iteration (s): 15.25 | learning rate: 5.437E-06 | global batch size: 16 | lm loss: 7.157454E+00 | grad norm: 1.644 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1038/ 128728 | consumed samples: 16608 | consumed tokens: 34013184 | elapsed time per iteration (s): 15.23 | learning rate: 5.442E-06 | global batch size: 16 | lm loss: 7.269532E+00 | grad norm: 1.858 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1039/ 128728 | consumed samples: 16624 | consumed tokens: 34045952 | elapsed time per iteration (s): 15.26 | learning rate: 5.447E-06 | global batch size: 16 | lm loss: 7.390067E+00 | grad norm: 1.875 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1040/ 128728 | consumed samples: 16640 | consumed tokens: 34078720 | elapsed time per iteration (s): 15.20 | learning rate: 5.453E-06 | global batch size: 16 | lm loss: 7.319128E+00 | grad norm: 1.051 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1041/ 128728 | consumed samples: 16656 | consumed tokens: 34111488 | elapsed time per iteration (s): 15.25 | learning rate: 5.458E-06 | global batch size: 16 | lm loss: 7.343173E+00 | grad norm: 1.067 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1042/ 128728 | consumed samples: 16672 | consumed tokens: 34144256 | elapsed time per iteration (s): 15.23 | learning rate: 5.463E-06 | global batch size: 16 | lm loss: 7.418891E+00 | grad norm: 1.234 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1043/ 128728 | consumed samples: 16688 | consumed tokens: 34177024 | elapsed time per iteration (s): 15.23 | learning rate: 5.468E-06 | global batch size: 16 | lm loss: 7.088163E+00 | grad norm: 1.017 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1044/ 128728 | consumed samples: 16704 | consumed tokens: 34209792 | elapsed time per iteration (s): 15.28 | learning rate: 5.474E-06 | global batch size: 16 | lm loss: 7.283275E+00 | grad norm: 1.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 1045/ 128728 | consumed samples: 16720 | consumed tokens: 34242560 | elapsed time per iteration (s): 15.23 | learning rate: 5.479E-06 | global batch size: 16 | lm loss: 7.177429E+00 | grad norm: 1.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1046/ 128728 | consumed samples: 16736 | consumed tokens: 34275328 | elapsed time per iteration (s): 15.24 | learning rate: 5.484E-06 | global batch size: 16 | lm loss: 7.403968E+00 | grad norm: 1.264 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1047/ 128728 | consumed samples: 16752 | consumed tokens: 34308096 | elapsed time per iteration (s): 15.25 | learning rate: 5.489E-06 | global batch size: 16 | lm loss: 7.409142E+00 | grad norm: 1.580 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.04 | [default7]: iteration 1048/ 128728 | consumed samples: 16768 | consumed tokens: 34340864 | elapsed time per iteration (s): 15.21 | learning rate: 5.495E-06 | global batch size: 16 | lm loss: 7.269386E+00 | grad norm: 1.087 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1049/ 128728 | consumed samples: 16784 | consumed tokens: 34373632 | elapsed time per iteration (s): 15.17 | learning rate: 5.500E-06 | global batch size: 16 | lm loss: 7.443803E+00 | grad norm: 0.967 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 1050/ 128728 | consumed samples: 16800 | consumed tokens: 34406400 | elapsed time per iteration (s): 15.25 | learning rate: 5.505E-06 | global batch size: 16 | lm loss: 7.035776E+00 | grad norm: 1.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1051/ 128728 | consumed samples: 16816 | consumed tokens: 34439168 | elapsed time per iteration (s): 15.28 | learning rate: 5.510E-06 | global batch size: 16 | lm loss: 7.198908E+00 | grad norm: 1.057 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 1052/ 128728 | consumed samples: 16832 | consumed tokens: 34471936 | elapsed time per iteration (s): 15.24 | learning rate: 5.516E-06 | global batch size: 16 | lm loss: 7.287247E+00 | grad norm: 1.212 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1053/ 128728 | consumed samples: 16848 | consumed tokens: 34504704 | elapsed time per iteration (s): 15.22 | learning rate: 5.521E-06 | global batch size: 16 | lm loss: 7.180941E+00 | grad norm: 0.943 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1054/ 128728 | consumed samples: 16864 | consumed tokens: 34537472 | elapsed time per iteration (s): 15.25 | learning rate: 5.526E-06 | global batch size: 16 | lm loss: 7.035480E+00 | grad norm: 1.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1055/ 128728 | consumed samples: 16880 | consumed tokens: 34570240 | elapsed time per iteration (s): 15.25 | learning rate: 5.531E-06 | global batch size: 16 | lm loss: 7.411442E+00 | grad norm: 1.504 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1056/ 128728 | consumed samples: 16896 | consumed tokens: 34603008 | elapsed time per iteration (s): 15.24 | learning rate: 5.536E-06 | global batch size: 16 | lm loss: 7.284391E+00 | grad norm: 1.081 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1057/ 128728 | consumed samples: 16912 | consumed tokens: 34635776 | elapsed time per iteration (s): 15.24 | learning rate: 5.542E-06 | global batch size: 16 | lm loss: 7.234114E+00 | grad norm: 1.008 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1058/ 128728 | consumed samples: 16928 | consumed tokens: 34668544 | elapsed time per iteration (s): 15.29 | learning rate: 5.547E-06 | global batch size: 16 | lm loss: 7.331013E+00 | grad norm: 1.068 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.046 | TFLOPs: 8.01 | [default7]: iteration 1059/ 128728 | consumed samples: 16944 | consumed tokens: 34701312 | elapsed time per iteration (s): 15.23 | learning rate: 5.552E-06 | global batch size: 16 | lm loss: 7.221325E+00 | grad norm: 1.413 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1060/ 128728 | consumed samples: 16960 | consumed tokens: 34734080 | elapsed time per iteration (s): 15.23 | learning rate: 5.557E-06 | global batch size: 16 | lm loss: 7.175035E+00 | grad norm: 1.219 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1061/ 128728 | consumed samples: 16976 | consumed tokens: 34766848 | elapsed time per iteration (s): 15.26 | learning rate: 5.563E-06 | global batch size: 16 | lm loss: 7.444801E+00 | grad norm: 1.268 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1062/ 128728 | consumed samples: 16992 | consumed tokens: 34799616 | elapsed time per iteration (s): 15.26 | learning rate: 5.568E-06 | global batch size: 16 | lm loss: 7.480289E+00 | grad norm: 1.083 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1063/ 128728 | consumed samples: 17008 | consumed tokens: 34832384 | elapsed time per iteration (s): 15.23 | learning rate: 5.573E-06 | global batch size: 16 | lm loss: 7.148155E+00 | grad norm: 0.926 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1064/ 128728 | consumed samples: 17024 | consumed tokens: 34865152 | elapsed time per iteration (s): 15.14 | learning rate: 5.578E-06 | global batch size: 16 | lm loss: 7.344573E+00 | grad norm: 1.057 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 1065/ 128728 | consumed samples: 17040 | consumed tokens: 34897920 | elapsed time per iteration (s): 15.25 | learning rate: 5.584E-06 | global batch size: 16 | lm loss: 7.196020E+00 | grad norm: 1.599 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1066/ 128728 | consumed samples: 17056 | consumed tokens: 34930688 | elapsed time per iteration (s): 15.26 | learning rate: 5.589E-06 | global batch size: 16 | lm loss: 7.104638E+00 | grad norm: 1.020 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1067/ 128728 | consumed samples: 17072 | consumed tokens: 34963456 | elapsed time per iteration (s): 15.24 | learning rate: 5.594E-06 | global batch size: 16 | lm loss: 7.402941E+00 | grad norm: 0.889 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1068/ 128728 | consumed samples: 17088 | consumed tokens: 34996224 | elapsed time per iteration (s): 15.27 | learning rate: 5.599E-06 | global batch size: 16 | lm loss: 7.603527E+00 | grad norm: 1.426 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1069/ 128728 | consumed samples: 17104 | consumed tokens: 35028992 | elapsed time per iteration (s): 15.24 | learning rate: 5.605E-06 | global batch size: 16 | lm loss: 7.494851E+00 | grad norm: 1.085 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1070/ 128728 | consumed samples: 17120 | consumed tokens: 35061760 | elapsed time per iteration (s): 15.23 | learning rate: 5.610E-06 | global batch size: 16 | lm loss: 7.395302E+00 | grad norm: 1.010 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1071/ 128728 | consumed samples: 17136 | consumed tokens: 35094528 | elapsed time per iteration (s): 15.29 | learning rate: 5.615E-06 | global batch size: 16 | lm loss: 7.198095E+00 | grad norm: 1.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.01 | [default7]: iteration 1072/ 128728 | consumed samples: 17152 | consumed tokens: 35127296 | elapsed time per iteration (s): 15.27 | learning rate: 5.620E-06 | global batch size: 16 | lm loss: 7.297481E+00 | grad norm: 1.358 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1073/ 128728 | consumed samples: 17168 | consumed tokens: 35160064 | elapsed time per iteration (s): 15.27 | learning rate: 5.626E-06 | global batch size: 16 | lm loss: 7.169433E+00 | grad norm: 0.963 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1074/ 128728 | consumed samples: 17184 | consumed tokens: 35192832 | elapsed time per iteration (s): 15.26 | learning rate: 5.631E-06 | global batch size: 16 | lm loss: 7.143753E+00 | grad norm: 1.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1075/ 128728 | consumed samples: 17200 | consumed tokens: 35225600 | elapsed time per iteration (s): 15.24 | learning rate: 5.636E-06 | global batch size: 16 | lm loss: 7.086334E+00 | grad norm: 0.929 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1076/ 128728 | consumed samples: 17216 | consumed tokens: 35258368 | elapsed time per iteration (s): 15.28 | learning rate: 5.641E-06 | global batch size: 16 | lm loss: 7.248414E+00 | grad norm: 1.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.01 | [default7]: iteration 1077/ 128728 | consumed samples: 17232 | consumed tokens: 35291136 | elapsed time per iteration (s): 15.27 | learning rate: 5.647E-06 | global batch size: 16 | lm loss: 7.515269E+00 | grad norm: 2.106 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1078/ 128728 | consumed samples: 17248 | consumed tokens: 35323904 | elapsed time per iteration (s): 15.25 | learning rate: 5.652E-06 | global batch size: 16 | lm loss: 7.372351E+00 | grad norm: 2.026 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1079/ 128728 | consumed samples: 17264 | consumed tokens: 35356672 | elapsed time per iteration (s): 15.23 | learning rate: 5.657E-06 | global batch size: 16 | lm loss: 7.441353E+00 | grad norm: 2.469 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1080/ 128728 | consumed samples: 17280 | consumed tokens: 35389440 | elapsed time per iteration (s): 15.25 | learning rate: 5.662E-06 | global batch size: 16 | lm loss: 7.178278E+00 | grad norm: 1.792 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1081/ 128728 | consumed samples: 17296 | consumed tokens: 35422208 | elapsed time per iteration (s): 15.19 | learning rate: 5.668E-06 | global batch size: 16 | lm loss: 7.478823E+00 | grad norm: 1.421 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1082/ 128728 | consumed samples: 17312 | consumed tokens: 35454976 | elapsed time per iteration (s): 15.26 | learning rate: 5.673E-06 | global batch size: 16 | lm loss: 7.295471E+00 | grad norm: 1.048 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1083/ 128728 | consumed samples: 17328 | consumed tokens: 35487744 | elapsed time per iteration (s): 15.22 | learning rate: 5.678E-06 | global batch size: 16 | lm loss: 7.328071E+00 | grad norm: 1.106 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1084/ 128728 | consumed samples: 17344 | consumed tokens: 35520512 | elapsed time per iteration (s): 15.24 | learning rate: 5.683E-06 | global batch size: 16 | lm loss: 7.163485E+00 | grad norm: 0.954 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1085/ 128728 | consumed samples: 17360 | consumed tokens: 35553280 | elapsed time per iteration (s): 15.28 | learning rate: 5.689E-06 | global batch size: 16 | lm loss: 7.288455E+00 | grad norm: 0.989 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 1086/ 128728 | consumed samples: 17376 | consumed tokens: 35586048 | elapsed time per iteration (s): 15.24 | learning rate: 5.694E-06 | global batch size: 16 | lm loss: 7.212840E+00 | grad norm: 1.046 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1087/ 128728 | consumed samples: 17392 | consumed tokens: 35618816 | elapsed time per iteration (s): 15.24 | learning rate: 5.699E-06 | global batch size: 16 | lm loss: 7.166890E+00 | grad norm: 1.003 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1088/ 128728 | consumed samples: 17408 | consumed tokens: 35651584 | elapsed time per iteration (s): 15.24 | learning rate: 5.704E-06 | global batch size: 16 | lm loss: 7.437174E+00 | grad norm: 1.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1089/ 128728 | consumed samples: 17424 | consumed tokens: 35684352 | elapsed time per iteration (s): 15.27 | learning rate: 5.710E-06 | global batch size: 16 | lm loss: 7.178500E+00 | grad norm: 1.345 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1090/ 128728 | consumed samples: 17440 | consumed tokens: 35717120 | elapsed time per iteration (s): 15.24 | learning rate: 5.715E-06 | global batch size: 16 | lm loss: 7.343741E+00 | grad norm: 1.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1091/ 128728 | consumed samples: 17456 | consumed tokens: 35749888 | elapsed time per iteration (s): 15.26 | learning rate: 5.720E-06 | global batch size: 16 | lm loss: 7.443361E+00 | grad norm: 1.581 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1092/ 128728 | consumed samples: 17472 | consumed tokens: 35782656 | elapsed time per iteration (s): 15.23 | learning rate: 5.725E-06 | global batch size: 16 | lm loss: 7.196815E+00 | grad norm: 1.375 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1093/ 128728 | consumed samples: 17488 | consumed tokens: 35815424 | elapsed time per iteration (s): 15.19 | learning rate: 5.730E-06 | global batch size: 16 | lm loss: 7.417691E+00 | grad norm: 1.077 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1094/ 128728 | consumed samples: 17504 | consumed tokens: 35848192 | elapsed time per iteration (s): 15.23 | learning rate: 5.736E-06 | global batch size: 16 | lm loss: 7.217441E+00 | grad norm: 1.279 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1095/ 128728 | consumed samples: 17520 | consumed tokens: 35880960 | elapsed time per iteration (s): 15.26 | learning rate: 5.741E-06 | global batch size: 16 | lm loss: 7.141168E+00 | grad norm: 0.961 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1096/ 128728 | consumed samples: 17536 | consumed tokens: 35913728 | elapsed time per iteration (s): 15.23 | learning rate: 5.746E-06 | global batch size: 16 | lm loss: 7.413390E+00 | grad norm: 1.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1097/ 128728 | consumed samples: 17552 | consumed tokens: 35946496 | elapsed time per iteration (s): 15.23 | learning rate: 5.751E-06 | global batch size: 16 | lm loss: 7.284686E+00 | grad norm: 1.047 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1098/ 128728 | consumed samples: 17568 | consumed tokens: 35979264 | elapsed time per iteration (s): 15.25 | learning rate: 5.757E-06 | global batch size: 16 | lm loss: 7.118299E+00 | grad norm: 1.404 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1099/ 128728 | consumed samples: 17584 | consumed tokens: 36012032 | elapsed time per iteration (s): 15.26 | learning rate: 5.762E-06 | global batch size: 16 | lm loss: 7.185723E+00 | grad norm: 1.040 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1100/ 128728 | consumed samples: 17600 | consumed tokens: 36044800 | elapsed time per iteration (s): 15.24 | learning rate: 5.767E-06 | global batch size: 16 | lm loss: 7.335216E+00 | grad norm: 1.337 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1101/ 128728 | consumed samples: 17616 | consumed tokens: 36077568 | elapsed time per iteration (s): 15.21 | learning rate: 5.772E-06 | global batch size: 16 | lm loss: 7.115668E+00 | grad norm: 1.106 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1102/ 128728 | consumed samples: 17632 | consumed tokens: 36110336 | elapsed time per iteration (s): 15.23 | learning rate: 5.778E-06 | global batch size: 16 | lm loss: 7.229290E+00 | grad norm: 1.643 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1103/ 128728 | consumed samples: 17648 | consumed tokens: 36143104 | elapsed time per iteration (s): 15.21 | learning rate: 5.783E-06 | global batch size: 16 | lm loss: 7.195288E+00 | grad norm: 0.993 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1104/ 128728 | consumed samples: 17664 | consumed tokens: 36175872 | elapsed time per iteration (s): 15.24 | learning rate: 5.788E-06 | global batch size: 16 | lm loss: 7.160654E+00 | grad norm: 1.414 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1105/ 128728 | consumed samples: 17680 | consumed tokens: 36208640 | elapsed time per iteration (s): 15.26 | learning rate: 5.793E-06 | global batch size: 16 | lm loss: 7.244509E+00 | grad norm: 0.998 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1106/ 128728 | consumed samples: 17696 | consumed tokens: 36241408 | elapsed time per iteration (s): 15.25 | learning rate: 5.799E-06 | global batch size: 16 | lm loss: 7.244285E+00 | grad norm: 1.281 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1107/ 128728 | consumed samples: 17712 | consumed tokens: 36274176 | elapsed time per iteration (s): 15.25 | learning rate: 5.804E-06 | global batch size: 16 | lm loss: 7.213204E+00 | grad norm: 0.970 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1108/ 128728 | consumed samples: 17728 | consumed tokens: 36306944 | elapsed time per iteration (s): 15.25 | learning rate: 5.809E-06 | global batch size: 16 | lm loss: 7.250452E+00 | grad norm: 1.345 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1109/ 128728 | consumed samples: 17744 | consumed tokens: 36339712 | elapsed time per iteration (s): 15.22 | learning rate: 5.814E-06 | global batch size: 16 | lm loss: 7.291537E+00 | grad norm: 0.972 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1110/ 128728 | consumed samples: 17760 | consumed tokens: 36372480 | elapsed time per iteration (s): 15.21 | learning rate: 5.820E-06 | global batch size: 16 | lm loss: 7.145199E+00 | grad norm: 1.012 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1111/ 128728 | consumed samples: 17776 | consumed tokens: 36405248 | elapsed time per iteration (s): 15.25 | learning rate: 5.825E-06 | global batch size: 16 | lm loss: 7.345960E+00 | grad norm: 1.036 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1112/ 128728 | consumed samples: 17792 | consumed tokens: 36438016 | elapsed time per iteration (s): 15.22 | learning rate: 5.830E-06 | global batch size: 16 | lm loss: 7.107178E+00 | grad norm: 0.910 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1113/ 128728 | consumed samples: 17808 | consumed tokens: 36470784 | elapsed time per iteration (s): 15.23 | learning rate: 5.835E-06 | global batch size: 16 | lm loss: 6.999576E+00 | grad norm: 0.989 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1114/ 128728 | consumed samples: 17824 | consumed tokens: 36503552 | elapsed time per iteration (s): 15.23 | learning rate: 5.841E-06 | global batch size: 16 | lm loss: 7.287607E+00 | grad norm: 1.070 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1115/ 128728 | consumed samples: 17840 | consumed tokens: 36536320 | elapsed time per iteration (s): 15.24 | learning rate: 5.846E-06 | global batch size: 16 | lm loss: 7.054477E+00 | grad norm: 1.085 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1116/ 128728 | consumed samples: 17856 | consumed tokens: 36569088 | elapsed time per iteration (s): 15.23 | learning rate: 5.851E-06 | global batch size: 16 | lm loss: 7.217619E+00 | grad norm: 0.921 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1117/ 128728 | consumed samples: 17872 | consumed tokens: 36601856 | elapsed time per iteration (s): 15.23 | learning rate: 5.856E-06 | global batch size: 16 | lm loss: 7.185878E+00 | grad norm: 0.975 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1118/ 128728 | consumed samples: 17888 | consumed tokens: 36634624 | elapsed time per iteration (s): 15.21 | learning rate: 5.862E-06 | global batch size: 16 | lm loss: 7.304596E+00 | grad norm: 0.967 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1119/ 128728 | consumed samples: 17904 | consumed tokens: 36667392 | elapsed time per iteration (s): 15.22 | learning rate: 5.867E-06 | global batch size: 16 | lm loss: 7.287797E+00 | grad norm: 0.989 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1120/ 128728 | consumed samples: 17920 | consumed tokens: 36700160 | elapsed time per iteration (s): 15.26 | learning rate: 5.872E-06 | global batch size: 16 | lm loss: 7.236816E+00 | grad norm: 1.061 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1121/ 128728 | consumed samples: 17936 | consumed tokens: 36732928 | elapsed time per iteration (s): 15.22 | learning rate: 5.877E-06 | global batch size: 16 | lm loss: 7.148897E+00 | grad norm: 0.859 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1122/ 128728 | consumed samples: 17952 | consumed tokens: 36765696 | elapsed time per iteration (s): 15.23 | learning rate: 5.883E-06 | global batch size: 16 | lm loss: 7.309883E+00 | grad norm: 0.923 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1123/ 128728 | consumed samples: 17968 | consumed tokens: 36798464 | elapsed time per iteration (s): 15.18 | learning rate: 5.888E-06 | global batch size: 16 | lm loss: 7.121294E+00 | grad norm: 1.055 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 1124/ 128728 | consumed samples: 17984 | consumed tokens: 36831232 | elapsed time per iteration (s): 15.23 | learning rate: 5.893E-06 | global batch size: 16 | lm loss: 7.235108E+00 | grad norm: 0.990 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1125/ 128728 | consumed samples: 18000 | consumed tokens: 36864000 | elapsed time per iteration (s): 15.22 | learning rate: 5.898E-06 | global batch size: 16 | lm loss: 7.221193E+00 | grad norm: 1.288 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1126/ 128728 | consumed samples: 18016 | consumed tokens: 36896768 | elapsed time per iteration (s): 15.24 | learning rate: 5.903E-06 | global batch size: 16 | lm loss: 7.522739E+00 | grad norm: 1.545 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1127/ 128728 | consumed samples: 18032 | consumed tokens: 36929536 | elapsed time per iteration (s): 15.25 | learning rate: 5.909E-06 | global batch size: 16 | lm loss: 7.258095E+00 | grad norm: 1.239 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1128/ 128728 | consumed samples: 18048 | consumed tokens: 36962304 | elapsed time per iteration (s): 15.24 | learning rate: 5.914E-06 | global batch size: 16 | lm loss: 7.177681E+00 | grad norm: 1.109 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1129/ 128728 | consumed samples: 18064 | consumed tokens: 36995072 | elapsed time per iteration (s): 15.23 | learning rate: 5.919E-06 | global batch size: 16 | lm loss: 7.164636E+00 | grad norm: 1.038 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1130/ 128728 | consumed samples: 18080 | consumed tokens: 37027840 | elapsed time per iteration (s): 15.23 | learning rate: 5.924E-06 | global batch size: 16 | lm loss: 6.921859E+00 | grad norm: 1.051 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1131/ 128728 | consumed samples: 18096 | consumed tokens: 37060608 | elapsed time per iteration (s): 15.23 | learning rate: 5.930E-06 | global batch size: 16 | lm loss: 6.996799E+00 | grad norm: 0.942 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1132/ 128728 | consumed samples: 18112 | consumed tokens: 37093376 | elapsed time per iteration (s): 15.26 | learning rate: 5.935E-06 | global batch size: 16 | lm loss: 7.323952E+00 | grad norm: 0.933 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1133/ 128728 | consumed samples: 18128 | consumed tokens: 37126144 | elapsed time per iteration (s): 15.22 | learning rate: 5.940E-06 | global batch size: 16 | lm loss: 7.006363E+00 | grad norm: 1.066 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1134/ 128728 | consumed samples: 18144 | consumed tokens: 37158912 | elapsed time per iteration (s): 15.26 | learning rate: 5.945E-06 | global batch size: 16 | lm loss: 7.190140E+00 | grad norm: 1.403 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1135/ 128728 | consumed samples: 18160 | consumed tokens: 37191680 | elapsed time per iteration (s): 15.27 | learning rate: 5.951E-06 | global batch size: 16 | lm loss: 7.225429E+00 | grad norm: 1.054 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1136/ 128728 | consumed samples: 18176 | consumed tokens: 37224448 | elapsed time per iteration (s): 15.17 | learning rate: 5.956E-06 | global batch size: 16 | lm loss: 7.188299E+00 | grad norm: 0.979 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 1137/ 128728 | consumed samples: 18192 | consumed tokens: 37257216 | elapsed time per iteration (s): 15.20 | learning rate: 5.961E-06 | global batch size: 16 | lm loss: 7.277708E+00 | grad norm: 0.933 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1138/ 128728 | consumed samples: 18208 | consumed tokens: 37289984 | elapsed time per iteration (s): 15.23 | learning rate: 5.966E-06 | global batch size: 16 | lm loss: 7.208605E+00 | grad norm: 0.925 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1139/ 128728 | consumed samples: 18224 | consumed tokens: 37322752 | elapsed time per iteration (s): 15.24 | learning rate: 5.972E-06 | global batch size: 16 | lm loss: 7.097051E+00 | grad norm: 0.951 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1140/ 128728 | consumed samples: 18240 | consumed tokens: 37355520 | elapsed time per iteration (s): 15.22 | learning rate: 5.977E-06 | global batch size: 16 | lm loss: 7.225067E+00 | grad norm: 0.915 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1141/ 128728 | consumed samples: 18256 | consumed tokens: 37388288 | elapsed time per iteration (s): 15.21 | learning rate: 5.982E-06 | global batch size: 16 | lm loss: 7.149609E+00 | grad norm: 0.928 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1142/ 128728 | consumed samples: 18272 | consumed tokens: 37421056 | elapsed time per iteration (s): 15.22 | learning rate: 5.987E-06 | global batch size: 16 | lm loss: 7.092099E+00 | grad norm: 0.925 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1143/ 128728 | consumed samples: 18288 | consumed tokens: 37453824 | elapsed time per iteration (s): 15.27 | learning rate: 5.993E-06 | global batch size: 16 | lm loss: 7.053136E+00 | grad norm: 1.227 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1144/ 128728 | consumed samples: 18304 | consumed tokens: 37486592 | elapsed time per iteration (s): 15.25 | learning rate: 5.998E-06 | global batch size: 16 | lm loss: 7.427276E+00 | grad norm: 0.961 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1145/ 128728 | consumed samples: 18320 | consumed tokens: 37519360 | elapsed time per iteration (s): 15.17 | learning rate: 6.003E-06 | global batch size: 16 | lm loss: 7.303183E+00 | grad norm: 1.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 1146/ 128728 | consumed samples: 18336 | consumed tokens: 37552128 | elapsed time per iteration (s): 15.22 | learning rate: 6.008E-06 | global batch size: 16 | lm loss: 7.172232E+00 | grad norm: 1.002 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1147/ 128728 | consumed samples: 18352 | consumed tokens: 37584896 | elapsed time per iteration (s): 15.25 | learning rate: 6.014E-06 | global batch size: 16 | lm loss: 7.312620E+00 | grad norm: 0.950 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1148/ 128728 | consumed samples: 18368 | consumed tokens: 37617664 | elapsed time per iteration (s): 15.23 | learning rate: 6.019E-06 | global batch size: 16 | lm loss: 7.070820E+00 | grad norm: 1.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1149/ 128728 | consumed samples: 18384 | consumed tokens: 37650432 | elapsed time per iteration (s): 15.16 | learning rate: 6.024E-06 | global batch size: 16 | lm loss: 7.238826E+00 | grad norm: 0.927 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 1150/ 128728 | consumed samples: 18400 | consumed tokens: 37683200 | elapsed time per iteration (s): 15.22 | learning rate: 6.029E-06 | global batch size: 16 | lm loss: 7.159998E+00 | grad norm: 1.042 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1151/ 128728 | consumed samples: 18416 | consumed tokens: 37715968 | elapsed time per iteration (s): 15.23 | learning rate: 6.035E-06 | global batch size: 16 | lm loss: 7.089765E+00 | grad norm: 1.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1152/ 128728 | consumed samples: 18432 | consumed tokens: 37748736 | elapsed time per iteration (s): 15.23 | learning rate: 6.040E-06 | global batch size: 16 | lm loss: 7.016187E+00 | grad norm: 1.086 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1153/ 128728 | consumed samples: 18448 | consumed tokens: 37781504 | elapsed time per iteration (s): 15.19 | learning rate: 6.045E-06 | global batch size: 16 | lm loss: 7.231027E+00 | grad norm: 1.328 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1154/ 128728 | consumed samples: 18464 | consumed tokens: 37814272 | elapsed time per iteration (s): 15.23 | learning rate: 6.050E-06 | global batch size: 16 | lm loss: 7.197011E+00 | grad norm: 0.977 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1155/ 128728 | consumed samples: 18480 | consumed tokens: 37847040 | elapsed time per iteration (s): 15.26 | learning rate: 6.056E-06 | global batch size: 16 | lm loss: 7.368340E+00 | grad norm: 1.038 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1156/ 128728 | consumed samples: 18496 | consumed tokens: 37879808 | elapsed time per iteration (s): 15.20 | learning rate: 6.061E-06 | global batch size: 16 | lm loss: 7.069404E+00 | grad norm: 0.971 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1157/ 128728 | consumed samples: 18512 | consumed tokens: 37912576 | elapsed time per iteration (s): 15.22 | learning rate: 6.066E-06 | global batch size: 16 | lm loss: 7.192194E+00 | grad norm: 1.085 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1158/ 128728 | consumed samples: 18528 | consumed tokens: 37945344 | elapsed time per iteration (s): 15.22 | learning rate: 6.071E-06 | global batch size: 16 | lm loss: 7.340763E+00 | grad norm: 1.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1159/ 128728 | consumed samples: 18544 | consumed tokens: 37978112 | elapsed time per iteration (s): 15.25 | learning rate: 6.077E-06 | global batch size: 16 | lm loss: 6.942504E+00 | grad norm: 1.469 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1160/ 128728 | consumed samples: 18560 | consumed tokens: 38010880 | elapsed time per iteration (s): 15.26 | learning rate: 6.082E-06 | global batch size: 16 | lm loss: 7.018706E+00 | grad norm: 1.079 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1161/ 128728 | consumed samples: 18576 | consumed tokens: 38043648 | elapsed time per iteration (s): 15.21 | learning rate: 6.087E-06 | global batch size: 16 | lm loss: 7.082819E+00 | grad norm: 1.087 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1162/ 128728 | consumed samples: 18592 | consumed tokens: 38076416 | elapsed time per iteration (s): 15.24 | learning rate: 6.092E-06 | global batch size: 16 | lm loss: 7.236361E+00 | grad norm: 1.023 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1163/ 128728 | consumed samples: 18608 | consumed tokens: 38109184 | elapsed time per iteration (s): 15.20 | learning rate: 6.097E-06 | global batch size: 16 | lm loss: 7.258739E+00 | grad norm: 0.995 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1164/ 128728 | consumed samples: 18624 | consumed tokens: 38141952 | elapsed time per iteration (s): 15.25 | learning rate: 6.103E-06 | global batch size: 16 | lm loss: 6.894892E+00 | grad norm: 0.914 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1165/ 128728 | consumed samples: 18640 | consumed tokens: 38174720 | elapsed time per iteration (s): 15.26 | learning rate: 6.108E-06 | global batch size: 16 | lm loss: 7.280957E+00 | grad norm: 1.115 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1166/ 128728 | consumed samples: 18656 | consumed tokens: 38207488 | elapsed time per iteration (s): 15.21 | learning rate: 6.113E-06 | global batch size: 16 | lm loss: 7.098267E+00 | grad norm: 0.902 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1167/ 128728 | consumed samples: 18672 | consumed tokens: 38240256 | elapsed time per iteration (s): 15.30 | learning rate: 6.118E-06 | global batch size: 16 | lm loss: 7.147165E+00 | grad norm: 0.948 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.046 | TFLOPs: 8.01 | [default7]: iteration 1168/ 128728 | consumed samples: 18688 | consumed tokens: 38273024 | elapsed time per iteration (s): 15.23 | learning rate: 6.124E-06 | global batch size: 16 | lm loss: 7.112779E+00 | grad norm: 1.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1169/ 128728 | consumed samples: 18704 | consumed tokens: 38305792 | elapsed time per iteration (s): 15.26 | learning rate: 6.129E-06 | global batch size: 16 | lm loss: 7.251498E+00 | grad norm: 1.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1170/ 128728 | consumed samples: 18720 | consumed tokens: 38338560 | elapsed time per iteration (s): 15.27 | learning rate: 6.134E-06 | global batch size: 16 | lm loss: 7.245819E+00 | grad norm: 1.799 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1171/ 128728 | consumed samples: 18736 | consumed tokens: 38371328 | elapsed time per iteration (s): 15.21 | learning rate: 6.139E-06 | global batch size: 16 | lm loss: 7.118947E+00 | grad norm: 1.379 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1172/ 128728 | consumed samples: 18752 | consumed tokens: 38404096 | elapsed time per iteration (s): 15.22 | learning rate: 6.145E-06 | global batch size: 16 | lm loss: 7.312955E+00 | grad norm: 1.041 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1173/ 128728 | consumed samples: 18768 | consumed tokens: 38436864 | elapsed time per iteration (s): 15.21 | learning rate: 6.150E-06 | global batch size: 16 | lm loss: 7.203588E+00 | grad norm: 1.038 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1174/ 128728 | consumed samples: 18784 | consumed tokens: 38469632 | elapsed time per iteration (s): 15.16 | learning rate: 6.155E-06 | global batch size: 16 | lm loss: 7.083356E+00 | grad norm: 0.857 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 1175/ 128728 | consumed samples: 18800 | consumed tokens: 38502400 | elapsed time per iteration (s): 15.23 | learning rate: 6.160E-06 | global batch size: 16 | lm loss: 7.164299E+00 | grad norm: 0.995 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1176/ 128728 | consumed samples: 18816 | consumed tokens: 38535168 | elapsed time per iteration (s): 15.25 | learning rate: 6.166E-06 | global batch size: 16 | lm loss: 7.204933E+00 | grad norm: 1.073 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1177/ 128728 | consumed samples: 18832 | consumed tokens: 38567936 | elapsed time per iteration (s): 15.22 | learning rate: 6.171E-06 | global batch size: 16 | lm loss: 7.019668E+00 | grad norm: 0.950 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1178/ 128728 | consumed samples: 18848 | consumed tokens: 38600704 | elapsed time per iteration (s): 15.23 | learning rate: 6.176E-06 | global batch size: 16 | lm loss: 7.238056E+00 | grad norm: 1.089 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1179/ 128728 | consumed samples: 18864 | consumed tokens: 38633472 | elapsed time per iteration (s): 15.25 | learning rate: 6.181E-06 | global batch size: 16 | lm loss: 7.101101E+00 | grad norm: 1.046 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1180/ 128728 | consumed samples: 18880 | consumed tokens: 38666240 | elapsed time per iteration (s): 15.21 | learning rate: 6.187E-06 | global batch size: 16 | lm loss: 7.030687E+00 | grad norm: 0.987 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1181/ 128728 | consumed samples: 18896 | consumed tokens: 38699008 | elapsed time per iteration (s): 15.25 | learning rate: 6.192E-06 | global batch size: 16 | lm loss: 7.330659E+00 | grad norm: 1.406 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1182/ 128728 | consumed samples: 18912 | consumed tokens: 38731776 | elapsed time per iteration (s): 15.24 | learning rate: 6.197E-06 | global batch size: 16 | lm loss: 7.227168E+00 | grad norm: 1.077 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1183/ 128728 | consumed samples: 18928 | consumed tokens: 38764544 | elapsed time per iteration (s): 15.25 | learning rate: 6.202E-06 | global batch size: 16 | lm loss: 7.105655E+00 | grad norm: 1.249 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1184/ 128728 | consumed samples: 18944 | consumed tokens: 38797312 | elapsed time per iteration (s): 15.19 | learning rate: 6.208E-06 | global batch size: 16 | lm loss: 7.421823E+00 | grad norm: 1.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1185/ 128728 | consumed samples: 18960 | consumed tokens: 38830080 | elapsed time per iteration (s): 15.11 | learning rate: 6.213E-06 | global batch size: 16 | lm loss: 7.161137E+00 | grad norm: 0.996 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.059 | TFLOPs: 8.11 | [default7]: iteration 1186/ 128728 | consumed samples: 18976 | consumed tokens: 38862848 | elapsed time per iteration (s): 15.24 | learning rate: 6.218E-06 | global batch size: 16 | lm loss: 7.420480E+00 | grad norm: 1.236 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1187/ 128728 | consumed samples: 18992 | consumed tokens: 38895616 | elapsed time per iteration (s): 15.26 | learning rate: 6.223E-06 | global batch size: 16 | lm loss: 7.459645E+00 | grad norm: 1.665 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1188/ 128728 | consumed samples: 19008 | consumed tokens: 38928384 | elapsed time per iteration (s): 15.24 | learning rate: 6.229E-06 | global batch size: 16 | lm loss: 7.134075E+00 | grad norm: 1.416 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1189/ 128728 | consumed samples: 19024 | consumed tokens: 38961152 | elapsed time per iteration (s): 15.23 | learning rate: 6.234E-06 | global batch size: 16 | lm loss: 7.168115E+00 | grad norm: 1.050 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1190/ 128728 | consumed samples: 19040 | consumed tokens: 38993920 | elapsed time per iteration (s): 15.20 | learning rate: 6.239E-06 | global batch size: 16 | lm loss: 7.134392E+00 | grad norm: 1.056 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1191/ 128728 | consumed samples: 19056 | consumed tokens: 39026688 | elapsed time per iteration (s): 15.18 | learning rate: 6.244E-06 | global batch size: 16 | lm loss: 7.327762E+00 | grad norm: 1.271 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 1192/ 128728 | consumed samples: 19072 | consumed tokens: 39059456 | elapsed time per iteration (s): 15.27 | learning rate: 6.250E-06 | global batch size: 16 | lm loss: 7.085316E+00 | grad norm: 1.333 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1193/ 128728 | consumed samples: 19088 | consumed tokens: 39092224 | elapsed time per iteration (s): 15.24 | learning rate: 6.255E-06 | global batch size: 16 | lm loss: 7.026468E+00 | grad norm: 1.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1194/ 128728 | consumed samples: 19104 | consumed tokens: 39124992 | elapsed time per iteration (s): 15.26 | learning rate: 6.260E-06 | global batch size: 16 | lm loss: 7.376468E+00 | grad norm: 1.375 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1195/ 128728 | consumed samples: 19120 | consumed tokens: 39157760 | elapsed time per iteration (s): 15.25 | learning rate: 6.265E-06 | global batch size: 16 | lm loss: 7.219844E+00 | grad norm: 1.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1196/ 128728 | consumed samples: 19136 | consumed tokens: 39190528 | elapsed time per iteration (s): 15.25 | learning rate: 6.271E-06 | global batch size: 16 | lm loss: 7.149906E+00 | grad norm: 1.236 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1197/ 128728 | consumed samples: 19152 | consumed tokens: 39223296 | elapsed time per iteration (s): 15.22 | learning rate: 6.276E-06 | global batch size: 16 | lm loss: 6.934923E+00 | grad norm: 1.010 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1198/ 128728 | consumed samples: 19168 | consumed tokens: 39256064 | elapsed time per iteration (s): 15.22 | learning rate: 6.281E-06 | global batch size: 16 | lm loss: 6.979043E+00 | grad norm: 0.854 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1199/ 128728 | consumed samples: 19184 | consumed tokens: 39288832 | elapsed time per iteration (s): 15.23 | learning rate: 6.286E-06 | global batch size: 16 | lm loss: 7.078469E+00 | grad norm: 0.976 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1200/ 128728 | consumed samples: 19200 | consumed tokens: 39321600 | elapsed time per iteration (s): 15.25 | learning rate: 6.291E-06 | global batch size: 16 | lm loss: 7.111989E+00 | grad norm: 0.881 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1201/ 128728 | consumed samples: 19216 | consumed tokens: 39354368 | elapsed time per iteration (s): 15.23 | learning rate: 6.297E-06 | global batch size: 16 | lm loss: 7.255686E+00 | grad norm: 1.277 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1202/ 128728 | consumed samples: 19232 | consumed tokens: 39387136 | elapsed time per iteration (s): 15.23 | learning rate: 6.302E-06 | global batch size: 16 | lm loss: 7.404012E+00 | grad norm: 1.110 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1203/ 128728 | consumed samples: 19248 | consumed tokens: 39419904 | elapsed time per iteration (s): 15.26 | learning rate: 6.307E-06 | global batch size: 16 | lm loss: 7.017631E+00 | grad norm: 1.472 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1204/ 128728 | consumed samples: 19264 | consumed tokens: 39452672 | elapsed time per iteration (s): 15.22 | learning rate: 6.312E-06 | global batch size: 16 | lm loss: 7.073680E+00 | grad norm: 0.942 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1205/ 128728 | consumed samples: 19280 | consumed tokens: 39485440 | elapsed time per iteration (s): 15.24 | learning rate: 6.318E-06 | global batch size: 16 | lm loss: 7.345861E+00 | grad norm: 0.951 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1206/ 128728 | consumed samples: 19296 | consumed tokens: 39518208 | elapsed time per iteration (s): 15.21 | learning rate: 6.323E-06 | global batch size: 16 | lm loss: 7.009941E+00 | grad norm: 1.069 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1207/ 128728 | consumed samples: 19312 | consumed tokens: 39550976 | elapsed time per iteration (s): 15.23 | learning rate: 6.328E-06 | global batch size: 16 | lm loss: 7.123629E+00 | grad norm: 0.945 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1208/ 128728 | consumed samples: 19328 | consumed tokens: 39583744 | elapsed time per iteration (s): 15.17 | learning rate: 6.333E-06 | global batch size: 16 | lm loss: 7.077274E+00 | grad norm: 1.017 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 1209/ 128728 | consumed samples: 19344 | consumed tokens: 39616512 | elapsed time per iteration (s): 15.20 | learning rate: 6.339E-06 | global batch size: 16 | lm loss: 7.096000E+00 | grad norm: 0.869 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1210/ 128728 | consumed samples: 19360 | consumed tokens: 39649280 | elapsed time per iteration (s): 15.23 | learning rate: 6.344E-06 | global batch size: 16 | lm loss: 7.476648E+00 | grad norm: 1.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1211/ 128728 | consumed samples: 19376 | consumed tokens: 39682048 | elapsed time per iteration (s): 15.22 | learning rate: 6.349E-06 | global batch size: 16 | lm loss: 6.972303E+00 | grad norm: 1.382 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1212/ 128728 | consumed samples: 19392 | consumed tokens: 39714816 | elapsed time per iteration (s): 15.24 | learning rate: 6.354E-06 | global batch size: 16 | lm loss: 7.088462E+00 | grad norm: 1.216 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1213/ 128728 | consumed samples: 19408 | consumed tokens: 39747584 | elapsed time per iteration (s): 15.25 | learning rate: 6.360E-06 | global batch size: 16 | lm loss: 7.357036E+00 | grad norm: 1.961 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1214/ 128728 | consumed samples: 19424 | consumed tokens: 39780352 | elapsed time per iteration (s): 15.24 | learning rate: 6.365E-06 | global batch size: 16 | lm loss: 7.337027E+00 | grad norm: 1.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1215/ 128728 | consumed samples: 19440 | consumed tokens: 39813120 | elapsed time per iteration (s): 15.24 | learning rate: 6.370E-06 | global batch size: 16 | lm loss: 6.935066E+00 | grad norm: 0.934 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1216/ 128728 | consumed samples: 19456 | consumed tokens: 39845888 | elapsed time per iteration (s): 15.22 | learning rate: 6.375E-06 | global batch size: 16 | lm loss: 7.197056E+00 | grad norm: 0.875 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1217/ 128728 | consumed samples: 19472 | consumed tokens: 39878656 | elapsed time per iteration (s): 15.24 | learning rate: 6.381E-06 | global batch size: 16 | lm loss: 7.179683E+00 | grad norm: 1.091 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1218/ 128728 | consumed samples: 19488 | consumed tokens: 39911424 | elapsed time per iteration (s): 15.22 | learning rate: 6.386E-06 | global batch size: 16 | lm loss: 7.041315E+00 | grad norm: 1.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1219/ 128728 | consumed samples: 19504 | consumed tokens: 39944192 | elapsed time per iteration (s): 15.19 | learning rate: 6.391E-06 | global batch size: 16 | lm loss: 7.058975E+00 | grad norm: 0.946 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 1220/ 128728 | consumed samples: 19520 | consumed tokens: 39976960 | elapsed time per iteration (s): 15.25 | learning rate: 6.396E-06 | global batch size: 16 | lm loss: 7.103866E+00 | grad norm: 0.930 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1221/ 128728 | consumed samples: 19536 | consumed tokens: 40009728 | elapsed time per iteration (s): 15.22 | learning rate: 6.402E-06 | global batch size: 16 | lm loss: 7.216382E+00 | grad norm: 0.863 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1222/ 128728 | consumed samples: 19552 | consumed tokens: 40042496 | elapsed time per iteration (s): 15.24 | learning rate: 6.407E-06 | global batch size: 16 | lm loss: 6.964835E+00 | grad norm: 0.956 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1223/ 128728 | consumed samples: 19568 | consumed tokens: 40075264 | elapsed time per iteration (s): 15.19 | learning rate: 6.412E-06 | global batch size: 16 | lm loss: 6.933653E+00 | grad norm: 0.923 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1224/ 128728 | consumed samples: 19584 | consumed tokens: 40108032 | elapsed time per iteration (s): 15.26 | learning rate: 6.417E-06 | global batch size: 16 | lm loss: 7.316370E+00 | grad norm: 1.229 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1225/ 128728 | consumed samples: 19600 | consumed tokens: 40140800 | elapsed time per iteration (s): 15.22 | learning rate: 6.423E-06 | global batch size: 16 | lm loss: 7.115275E+00 | grad norm: 1.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1226/ 128728 | consumed samples: 19616 | consumed tokens: 40173568 | elapsed time per iteration (s): 15.25 | learning rate: 6.428E-06 | global batch size: 16 | lm loss: 7.078380E+00 | grad norm: 1.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1227/ 128728 | consumed samples: 19632 | consumed tokens: 40206336 | elapsed time per iteration (s): 15.32 | learning rate: 6.433E-06 | global batch size: 16 | lm loss: 7.140039E+00 | grad norm: 1.069 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.045 | TFLOPs: 8.00 | [default7]: iteration 1228/ 128728 | consumed samples: 19648 | consumed tokens: 40239104 | elapsed time per iteration (s): 15.24 | learning rate: 6.438E-06 | global batch size: 16 | lm loss: 6.979059E+00 | grad norm: 0.991 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1229/ 128728 | consumed samples: 19664 | consumed tokens: 40271872 | elapsed time per iteration (s): 15.24 | learning rate: 6.444E-06 | global batch size: 16 | lm loss: 7.118724E+00 | grad norm: 1.073 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1230/ 128728 | consumed samples: 19680 | consumed tokens: 40304640 | elapsed time per iteration (s): 15.24 | learning rate: 6.449E-06 | global batch size: 16 | lm loss: 7.120239E+00 | grad norm: 0.868 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1231/ 128728 | consumed samples: 19696 | consumed tokens: 40337408 | elapsed time per iteration (s): 15.18 | learning rate: 6.454E-06 | global batch size: 16 | lm loss: 7.180079E+00 | grad norm: 1.100 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 1232/ 128728 | consumed samples: 19712 | consumed tokens: 40370176 | elapsed time per iteration (s): 15.25 | learning rate: 6.459E-06 | global batch size: 16 | lm loss: 7.335692E+00 | grad norm: 0.977 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1233/ 128728 | consumed samples: 19728 | consumed tokens: 40402944 | elapsed time per iteration (s): 15.23 | learning rate: 6.464E-06 | global batch size: 16 | lm loss: 7.010607E+00 | grad norm: 0.975 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1234/ 128728 | consumed samples: 19744 | consumed tokens: 40435712 | elapsed time per iteration (s): 15.20 | learning rate: 6.470E-06 | global batch size: 16 | lm loss: 6.938548E+00 | grad norm: 0.895 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1235/ 128728 | consumed samples: 19760 | consumed tokens: 40468480 | elapsed time per iteration (s): 15.27 | learning rate: 6.475E-06 | global batch size: 16 | lm loss: 7.146415E+00 | grad norm: 1.280 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1236/ 128728 | consumed samples: 19776 | consumed tokens: 40501248 | elapsed time per iteration (s): 15.26 | learning rate: 6.480E-06 | global batch size: 16 | lm loss: 7.039947E+00 | grad norm: 1.246 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1237/ 128728 | consumed samples: 19792 | consumed tokens: 40534016 | elapsed time per iteration (s): 15.27 | learning rate: 6.485E-06 | global batch size: 16 | lm loss: 7.084141E+00 | grad norm: 1.252 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1238/ 128728 | consumed samples: 19808 | consumed tokens: 40566784 | elapsed time per iteration (s): 15.27 | learning rate: 6.491E-06 | global batch size: 16 | lm loss: 7.073313E+00 | grad norm: 1.034 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1239/ 128728 | consumed samples: 19824 | consumed tokens: 40599552 | elapsed time per iteration (s): 15.24 | learning rate: 6.496E-06 | global batch size: 16 | lm loss: 6.969284E+00 | grad norm: 1.236 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1240/ 128728 | consumed samples: 19840 | consumed tokens: 40632320 | elapsed time per iteration (s): 15.20 | learning rate: 6.501E-06 | global batch size: 16 | lm loss: 7.203765E+00 | grad norm: 1.238 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1241/ 128728 | consumed samples: 19856 | consumed tokens: 40665088 | elapsed time per iteration (s): 15.24 | learning rate: 6.506E-06 | global batch size: 16 | lm loss: 7.026887E+00 | grad norm: 1.078 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1242/ 128728 | consumed samples: 19872 | consumed tokens: 40697856 | elapsed time per iteration (s): 15.25 | learning rate: 6.512E-06 | global batch size: 16 | lm loss: 7.141012E+00 | grad norm: 1.024 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1243/ 128728 | consumed samples: 19888 | consumed tokens: 40730624 | elapsed time per iteration (s): 15.27 | learning rate: 6.517E-06 | global batch size: 16 | lm loss: 6.841239E+00 | grad norm: 0.968 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1244/ 128728 | consumed samples: 19904 | consumed tokens: 40763392 | elapsed time per iteration (s): 15.64 | learning rate: 6.522E-06 | global batch size: 16 | lm loss: 6.917506E+00 | grad norm: 0.961 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.023 | TFLOPs: 7.83 | [default7]: iteration 1245/ 128728 | consumed samples: 19920 | consumed tokens: 40796160 | elapsed time per iteration (s): 19.06 | learning rate: 6.527E-06 | global batch size: 16 | lm loss: 7.028550E+00 | grad norm: 0.868 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 0.840 | TFLOPs: 6.43 | [default7]: iteration 1246/ 128728 | consumed samples: 19936 | consumed tokens: 40828928 | elapsed time per iteration (s): 17.59 | learning rate: 6.533E-06 | global batch size: 16 | lm loss: 7.041822E+00 | grad norm: 0.902 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 0.910 | TFLOPs: 6.96 | [default7]: iteration 1247/ 128728 | consumed samples: 19952 | consumed tokens: 40861696 | elapsed time per iteration (s): 18.59 | learning rate: 6.538E-06 | global batch size: 16 | lm loss: 6.829185E+00 | grad norm: 0.906 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 0.861 | TFLOPs: 6.59 | [default7]: iteration 1248/ 128728 | consumed samples: 19968 | consumed tokens: 40894464 | elapsed time per iteration (s): 23.66 | learning rate: 6.543E-06 | global batch size: 16 | lm loss: 7.007943E+00 | grad norm: 1.088 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 0.676 | TFLOPs: 5.18 | [default7]: iteration 1249/ 128728 | consumed samples: 19984 | consumed tokens: 40927232 | elapsed time per iteration (s): 16.26 | learning rate: 6.548E-06 | global batch size: 16 | lm loss: 7.074346E+00 | grad norm: 1.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 0.984 | TFLOPs: 7.53 | [default7]: iteration 1250/ 128728 | consumed samples: 20000 | consumed tokens: 40960000 | elapsed time per iteration (s): 15.27 | learning rate: 6.554E-06 | global batch size: 16 | lm loss: 7.107431E+00 | grad norm: 1.066 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1251/ 128728 | consumed samples: 20016 | consumed tokens: 40992768 | elapsed time per iteration (s): 15.25 | learning rate: 6.559E-06 | global batch size: 16 | lm loss: 6.935212E+00 | grad norm: 0.984 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1252/ 128728 | consumed samples: 20032 | consumed tokens: 41025536 | elapsed time per iteration (s): 15.20 | learning rate: 6.564E-06 | global batch size: 16 | lm loss: 7.023438E+00 | grad norm: 0.908 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1253/ 128728 | consumed samples: 20048 | consumed tokens: 41058304 | elapsed time per iteration (s): 15.21 | learning rate: 6.569E-06 | global batch size: 16 | lm loss: 7.031582E+00 | grad norm: 0.851 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1254/ 128728 | consumed samples: 20064 | consumed tokens: 41091072 | elapsed time per iteration (s): 15.27 | learning rate: 6.575E-06 | global batch size: 16 | lm loss: 7.013303E+00 | grad norm: 1.230 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1255/ 128728 | consumed samples: 20080 | consumed tokens: 41123840 | elapsed time per iteration (s): 15.22 | learning rate: 6.580E-06 | global batch size: 16 | lm loss: 7.063211E+00 | grad norm: 0.895 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1256/ 128728 | consumed samples: 20096 | consumed tokens: 41156608 | elapsed time per iteration (s): 15.24 | learning rate: 6.585E-06 | global batch size: 16 | lm loss: 6.951248E+00 | grad norm: 0.999 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1257/ 128728 | consumed samples: 20112 | consumed tokens: 41189376 | elapsed time per iteration (s): 15.22 | learning rate: 6.590E-06 | global batch size: 16 | lm loss: 7.142652E+00 | grad norm: 0.868 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1258/ 128728 | consumed samples: 20128 | consumed tokens: 41222144 | elapsed time per iteration (s): 15.21 | learning rate: 6.596E-06 | global batch size: 16 | lm loss: 7.207096E+00 | grad norm: 1.005 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1259/ 128728 | consumed samples: 20144 | consumed tokens: 41254912 | elapsed time per iteration (s): 15.25 | learning rate: 6.601E-06 | global batch size: 16 | lm loss: 7.017394E+00 | grad norm: 0.893 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1260/ 128728 | consumed samples: 20160 | consumed tokens: 41287680 | elapsed time per iteration (s): 15.25 | learning rate: 6.606E-06 | global batch size: 16 | lm loss: 7.106571E+00 | grad norm: 0.957 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1261/ 128728 | consumed samples: 20176 | consumed tokens: 41320448 | elapsed time per iteration (s): 15.26 | learning rate: 6.611E-06 | global batch size: 16 | lm loss: 7.189216E+00 | grad norm: 1.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1262/ 128728 | consumed samples: 20192 | consumed tokens: 41353216 | elapsed time per iteration (s): 15.21 | learning rate: 6.617E-06 | global batch size: 16 | lm loss: 7.057990E+00 | grad norm: 0.853 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1263/ 128728 | consumed samples: 20208 | consumed tokens: 41385984 | elapsed time per iteration (s): 15.23 | learning rate: 6.622E-06 | global batch size: 16 | lm loss: 7.032105E+00 | grad norm: 1.006 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1264/ 128728 | consumed samples: 20224 | consumed tokens: 41418752 | elapsed time per iteration (s): 15.24 | learning rate: 6.627E-06 | global batch size: 16 | lm loss: 7.253157E+00 | grad norm: 1.383 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1265/ 128728 | consumed samples: 20240 | consumed tokens: 41451520 | elapsed time per iteration (s): 15.25 | learning rate: 6.632E-06 | global batch size: 16 | lm loss: 6.969168E+00 | grad norm: 1.025 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1266/ 128728 | consumed samples: 20256 | consumed tokens: 41484288 | elapsed time per iteration (s): 15.23 | learning rate: 6.638E-06 | global batch size: 16 | lm loss: 7.029213E+00 | grad norm: 0.997 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1267/ 128728 | consumed samples: 20272 | consumed tokens: 41517056 | elapsed time per iteration (s): 15.27 | learning rate: 6.643E-06 | global batch size: 16 | lm loss: 7.001297E+00 | grad norm: 1.045 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1268/ 128728 | consumed samples: 20288 | consumed tokens: 41549824 | elapsed time per iteration (s): 15.25 | learning rate: 6.648E-06 | global batch size: 16 | lm loss: 7.099933E+00 | grad norm: 1.005 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1269/ 128728 | consumed samples: 20304 | consumed tokens: 41582592 | elapsed time per iteration (s): 15.28 | learning rate: 6.653E-06 | global batch size: 16 | lm loss: 6.980592E+00 | grad norm: 1.030 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.01 | [default7]: iteration 1270/ 128728 | consumed samples: 20320 | consumed tokens: 41615360 | elapsed time per iteration (s): 15.22 | learning rate: 6.658E-06 | global batch size: 16 | lm loss: 7.266665E+00 | grad norm: 1.045 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1271/ 128728 | consumed samples: 20336 | consumed tokens: 41648128 | elapsed time per iteration (s): 15.24 | learning rate: 6.664E-06 | global batch size: 16 | lm loss: 6.836196E+00 | grad norm: 0.993 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1272/ 128728 | consumed samples: 20352 | consumed tokens: 41680896 | elapsed time per iteration (s): 15.23 | learning rate: 6.669E-06 | global batch size: 16 | lm loss: 7.449065E+00 | grad norm: 1.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1273/ 128728 | consumed samples: 20368 | consumed tokens: 41713664 | elapsed time per iteration (s): 15.24 | learning rate: 6.674E-06 | global batch size: 16 | lm loss: 7.271956E+00 | grad norm: 1.244 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1274/ 128728 | consumed samples: 20384 | consumed tokens: 41746432 | elapsed time per iteration (s): 15.24 | learning rate: 6.679E-06 | global batch size: 16 | lm loss: 7.223175E+00 | grad norm: 1.811 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1275/ 128728 | consumed samples: 20400 | consumed tokens: 41779200 | elapsed time per iteration (s): 15.25 | learning rate: 6.685E-06 | global batch size: 16 | lm loss: 7.255591E+00 | grad norm: 1.267 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1276/ 128728 | consumed samples: 20416 | consumed tokens: 41811968 | elapsed time per iteration (s): 15.24 | learning rate: 6.690E-06 | global batch size: 16 | lm loss: 7.017190E+00 | grad norm: 0.951 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1277/ 128728 | consumed samples: 20432 | consumed tokens: 41844736 | elapsed time per iteration (s): 15.23 | learning rate: 6.695E-06 | global batch size: 16 | lm loss: 7.104808E+00 | grad norm: 0.924 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1278/ 128728 | consumed samples: 20448 | consumed tokens: 41877504 | elapsed time per iteration (s): 15.23 | learning rate: 6.700E-06 | global batch size: 16 | lm loss: 7.052327E+00 | grad norm: 1.028 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1279/ 128728 | consumed samples: 20464 | consumed tokens: 41910272 | elapsed time per iteration (s): 15.24 | learning rate: 6.706E-06 | global batch size: 16 | lm loss: 7.316154E+00 | grad norm: 1.347 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1280/ 128728 | consumed samples: 20480 | consumed tokens: 41943040 | elapsed time per iteration (s): 15.25 | learning rate: 6.711E-06 | global batch size: 16 | lm loss: 7.109064E+00 | grad norm: 0.955 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1281/ 128728 | consumed samples: 20496 | consumed tokens: 41975808 | elapsed time per iteration (s): 15.18 | learning rate: 6.716E-06 | global batch size: 16 | lm loss: 7.014742E+00 | grad norm: 0.951 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 1282/ 128728 | consumed samples: 20512 | consumed tokens: 42008576 | elapsed time per iteration (s): 15.24 | learning rate: 6.721E-06 | global batch size: 16 | lm loss: 7.076769E+00 | grad norm: 1.098 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1283/ 128728 | consumed samples: 20528 | consumed tokens: 42041344 | elapsed time per iteration (s): 15.26 | learning rate: 6.727E-06 | global batch size: 16 | lm loss: 7.277905E+00 | grad norm: 1.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1284/ 128728 | consumed samples: 20544 | consumed tokens: 42074112 | elapsed time per iteration (s): 15.24 | learning rate: 6.732E-06 | global batch size: 16 | lm loss: 7.167206E+00 | grad norm: 0.961 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1285/ 128728 | consumed samples: 20560 | consumed tokens: 42106880 | elapsed time per iteration (s): 15.26 | learning rate: 6.737E-06 | global batch size: 16 | lm loss: 6.924407E+00 | grad norm: 3.384 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1286/ 128728 | consumed samples: 20576 | consumed tokens: 42139648 | elapsed time per iteration (s): 15.25 | learning rate: 6.742E-06 | global batch size: 16 | lm loss: 7.007799E+00 | grad norm: 1.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1287/ 128728 | consumed samples: 20592 | consumed tokens: 42172416 | elapsed time per iteration (s): 15.25 | learning rate: 6.748E-06 | global batch size: 16 | lm loss: 6.977216E+00 | grad norm: 1.107 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1288/ 128728 | consumed samples: 20608 | consumed tokens: 42205184 | elapsed time per iteration (s): 15.20 | learning rate: 6.753E-06 | global batch size: 16 | lm loss: 6.887696E+00 | grad norm: 0.921 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1289/ 128728 | consumed samples: 20624 | consumed tokens: 42237952 | elapsed time per iteration (s): 15.24 | learning rate: 6.758E-06 | global batch size: 16 | lm loss: 7.158238E+00 | grad norm: 1.292 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1290/ 128728 | consumed samples: 20640 | consumed tokens: 42270720 | elapsed time per iteration (s): 15.26 | learning rate: 6.763E-06 | global batch size: 16 | lm loss: 7.162902E+00 | grad norm: 1.583 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1291/ 128728 | consumed samples: 20656 | consumed tokens: 42303488 | elapsed time per iteration (s): 15.21 | learning rate: 6.769E-06 | global batch size: 16 | lm loss: 7.018879E+00 | grad norm: 0.950 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1292/ 128728 | consumed samples: 20672 | consumed tokens: 42336256 | elapsed time per iteration (s): 15.23 | learning rate: 6.774E-06 | global batch size: 16 | lm loss: 6.894781E+00 | grad norm: 1.090 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1293/ 128728 | consumed samples: 20688 | consumed tokens: 42369024 | elapsed time per iteration (s): 15.25 | learning rate: 6.779E-06 | global batch size: 16 | lm loss: 6.989740E+00 | grad norm: 1.050 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1294/ 128728 | consumed samples: 20704 | consumed tokens: 42401792 | elapsed time per iteration (s): 15.22 | learning rate: 6.784E-06 | global batch size: 16 | lm loss: 7.075770E+00 | grad norm: 0.886 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1295/ 128728 | consumed samples: 20720 | consumed tokens: 42434560 | elapsed time per iteration (s): 15.22 | learning rate: 6.790E-06 | global batch size: 16 | lm loss: 7.155486E+00 | grad norm: 0.870 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1296/ 128728 | consumed samples: 20736 | consumed tokens: 42467328 | elapsed time per iteration (s): 15.22 | learning rate: 6.795E-06 | global batch size: 16 | lm loss: 7.141552E+00 | grad norm: 0.922 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1297/ 128728 | consumed samples: 20752 | consumed tokens: 42500096 | elapsed time per iteration (s): 15.24 | learning rate: 6.800E-06 | global batch size: 16 | lm loss: 7.286495E+00 | grad norm: 1.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1298/ 128728 | consumed samples: 20768 | consumed tokens: 42532864 | elapsed time per iteration (s): 15.23 | learning rate: 6.805E-06 | global batch size: 16 | lm loss: 7.142656E+00 | grad norm: 1.762 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1299/ 128728 | consumed samples: 20784 | consumed tokens: 42565632 | elapsed time per iteration (s): 15.23 | learning rate: 6.811E-06 | global batch size: 16 | lm loss: 6.876920E+00 | grad norm: 0.941 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1300/ 128728 | consumed samples: 20800 | consumed tokens: 42598400 | elapsed time per iteration (s): 15.22 | learning rate: 6.816E-06 | global batch size: 16 | lm loss: 6.969202E+00 | grad norm: 0.848 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1301/ 128728 | consumed samples: 20816 | consumed tokens: 42631168 | elapsed time per iteration (s): 15.25 | learning rate: 6.821E-06 | global batch size: 16 | lm loss: 7.109032E+00 | grad norm: 0.924 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1302/ 128728 | consumed samples: 20832 | consumed tokens: 42663936 | elapsed time per iteration (s): 15.22 | learning rate: 6.826E-06 | global batch size: 16 | lm loss: 6.858071E+00 | grad norm: 0.924 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1303/ 128728 | consumed samples: 20848 | consumed tokens: 42696704 | elapsed time per iteration (s): 15.20 | learning rate: 6.831E-06 | global batch size: 16 | lm loss: 6.878172E+00 | grad norm: 0.912 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1304/ 128728 | consumed samples: 20864 | consumed tokens: 42729472 | elapsed time per iteration (s): 15.25 | learning rate: 6.837E-06 | global batch size: 16 | lm loss: 6.795415E+00 | grad norm: 0.981 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1305/ 128728 | consumed samples: 20880 | consumed tokens: 42762240 | elapsed time per iteration (s): 15.25 | learning rate: 6.842E-06 | global batch size: 16 | lm loss: 7.055003E+00 | grad norm: 0.948 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1306/ 128728 | consumed samples: 20896 | consumed tokens: 42795008 | elapsed time per iteration (s): 15.23 | learning rate: 6.847E-06 | global batch size: 16 | lm loss: 7.091806E+00 | grad norm: 0.766 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1307/ 128728 | consumed samples: 20912 | consumed tokens: 42827776 | elapsed time per iteration (s): 15.22 | learning rate: 6.852E-06 | global batch size: 16 | lm loss: 7.148190E+00 | grad norm: 0.954 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1308/ 128728 | consumed samples: 20928 | consumed tokens: 42860544 | elapsed time per iteration (s): 15.22 | learning rate: 6.858E-06 | global batch size: 16 | lm loss: 7.025421E+00 | grad norm: 0.972 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1309/ 128728 | consumed samples: 20944 | consumed tokens: 42893312 | elapsed time per iteration (s): 15.23 | learning rate: 6.863E-06 | global batch size: 16 | lm loss: 6.918103E+00 | grad norm: 0.834 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1310/ 128728 | consumed samples: 20960 | consumed tokens: 42926080 | elapsed time per iteration (s): 15.25 | learning rate: 6.868E-06 | global batch size: 16 | lm loss: 7.181193E+00 | grad norm: 1.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1311/ 128728 | consumed samples: 20976 | consumed tokens: 42958848 | elapsed time per iteration (s): 15.23 | learning rate: 6.873E-06 | global batch size: 16 | lm loss: 6.954082E+00 | grad norm: 0.909 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1312/ 128728 | consumed samples: 20992 | consumed tokens: 42991616 | elapsed time per iteration (s): 15.22 | learning rate: 6.879E-06 | global batch size: 16 | lm loss: 7.259136E+00 | grad norm: 1.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1313/ 128728 | consumed samples: 21008 | consumed tokens: 43024384 | elapsed time per iteration (s): 15.25 | learning rate: 6.884E-06 | global batch size: 16 | lm loss: 7.125967E+00 | grad norm: 1.046 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.04 | [default7]: iteration 1314/ 128728 | consumed samples: 21024 | consumed tokens: 43057152 | elapsed time per iteration (s): 15.22 | learning rate: 6.889E-06 | global batch size: 16 | lm loss: 6.829364E+00 | grad norm: 0.848 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1315/ 128728 | consumed samples: 21040 | consumed tokens: 43089920 | elapsed time per iteration (s): 15.26 | learning rate: 6.894E-06 | global batch size: 16 | lm loss: 6.958238E+00 | grad norm: 1.957 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1316/ 128728 | consumed samples: 21056 | consumed tokens: 43122688 | elapsed time per iteration (s): 15.24 | learning rate: 6.900E-06 | global batch size: 16 | lm loss: 7.172208E+00 | grad norm: 0.942 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1317/ 128728 | consumed samples: 21072 | consumed tokens: 43155456 | elapsed time per iteration (s): 15.22 | learning rate: 6.905E-06 | global batch size: 16 | lm loss: 7.113717E+00 | grad norm: 0.922 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1318/ 128728 | consumed samples: 21088 | consumed tokens: 43188224 | elapsed time per iteration (s): 15.23 | learning rate: 6.910E-06 | global batch size: 16 | lm loss: 6.939925E+00 | grad norm: 0.957 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1319/ 128728 | consumed samples: 21104 | consumed tokens: 43220992 | elapsed time per iteration (s): 15.21 | learning rate: 6.915E-06 | global batch size: 16 | lm loss: 7.170483E+00 | grad norm: 0.934 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1320/ 128728 | consumed samples: 21120 | consumed tokens: 43253760 | elapsed time per iteration (s): 15.21 | learning rate: 6.921E-06 | global batch size: 16 | lm loss: 6.819268E+00 | grad norm: 0.869 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1321/ 128728 | consumed samples: 21136 | consumed tokens: 43286528 | elapsed time per iteration (s): 15.23 | learning rate: 6.926E-06 | global batch size: 16 | lm loss: 6.952059E+00 | grad norm: 0.953 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1322/ 128728 | consumed samples: 21152 | consumed tokens: 43319296 | elapsed time per iteration (s): 15.20 | learning rate: 6.931E-06 | global batch size: 16 | lm loss: 7.053830E+00 | grad norm: 0.904 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1323/ 128728 | consumed samples: 21168 | consumed tokens: 43352064 | elapsed time per iteration (s): 15.20 | learning rate: 6.936E-06 | global batch size: 16 | lm loss: 6.946447E+00 | grad norm: 0.967 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1324/ 128728 | consumed samples: 21184 | consumed tokens: 43384832 | elapsed time per iteration (s): 15.23 | learning rate: 6.942E-06 | global batch size: 16 | lm loss: 7.168796E+00 | grad norm: 1.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1325/ 128728 | consumed samples: 21200 | consumed tokens: 43417600 | elapsed time per iteration (s): 15.21 | learning rate: 6.947E-06 | global batch size: 16 | lm loss: 6.921078E+00 | grad norm: 0.796 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1326/ 128728 | consumed samples: 21216 | consumed tokens: 43450368 | elapsed time per iteration (s): 15.24 | learning rate: 6.952E-06 | global batch size: 16 | lm loss: 7.017085E+00 | grad norm: 0.844 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1327/ 128728 | consumed samples: 21232 | consumed tokens: 43483136 | elapsed time per iteration (s): 15.20 | learning rate: 6.957E-06 | global batch size: 16 | lm loss: 7.202669E+00 | grad norm: 0.846 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1328/ 128728 | consumed samples: 21248 | consumed tokens: 43515904 | elapsed time per iteration (s): 15.29 | learning rate: 6.963E-06 | global batch size: 16 | lm loss: 7.048963E+00 | grad norm: 0.836 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.046 | TFLOPs: 8.01 | [default7]: iteration 1329/ 128728 | consumed samples: 21264 | consumed tokens: 43548672 | elapsed time per iteration (s): 15.23 | learning rate: 6.968E-06 | global batch size: 16 | lm loss: 6.967897E+00 | grad norm: 0.818 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1330/ 128728 | consumed samples: 21280 | consumed tokens: 43581440 | elapsed time per iteration (s): 15.26 | learning rate: 6.973E-06 | global batch size: 16 | lm loss: 7.000623E+00 | grad norm: 0.873 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1331/ 128728 | consumed samples: 21296 | consumed tokens: 43614208 | elapsed time per iteration (s): 15.25 | learning rate: 6.978E-06 | global batch size: 16 | lm loss: 7.296478E+00 | grad norm: 1.068 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1332/ 128728 | consumed samples: 21312 | consumed tokens: 43646976 | elapsed time per iteration (s): 15.23 | learning rate: 6.984E-06 | global batch size: 16 | lm loss: 7.027101E+00 | grad norm: 1.250 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1333/ 128728 | consumed samples: 21328 | consumed tokens: 43679744 | elapsed time per iteration (s): 15.21 | learning rate: 6.989E-06 | global batch size: 16 | lm loss: 7.019495E+00 | grad norm: 0.968 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1334/ 128728 | consumed samples: 21344 | consumed tokens: 43712512 | elapsed time per iteration (s): 15.23 | learning rate: 6.994E-06 | global batch size: 16 | lm loss: 6.855921E+00 | grad norm: 0.854 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1335/ 128728 | consumed samples: 21360 | consumed tokens: 43745280 | elapsed time per iteration (s): 15.24 | learning rate: 6.999E-06 | global batch size: 16 | lm loss: 7.009941E+00 | grad norm: 0.835 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1336/ 128728 | consumed samples: 21376 | consumed tokens: 43778048 | elapsed time per iteration (s): 15.26 | learning rate: 7.005E-06 | global batch size: 16 | lm loss: 6.913696E+00 | grad norm: 1.115 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1337/ 128728 | consumed samples: 21392 | consumed tokens: 43810816 | elapsed time per iteration (s): 15.23 | learning rate: 7.010E-06 | global batch size: 16 | lm loss: 7.002808E+00 | grad norm: 1.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1338/ 128728 | consumed samples: 21408 | consumed tokens: 43843584 | elapsed time per iteration (s): 15.26 | learning rate: 7.015E-06 | global batch size: 16 | lm loss: 7.006137E+00 | grad norm: 1.061 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1339/ 128728 | consumed samples: 21424 | consumed tokens: 43876352 | elapsed time per iteration (s): 15.26 | learning rate: 7.020E-06 | global batch size: 16 | lm loss: 6.981978E+00 | grad norm: 1.087 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1340/ 128728 | consumed samples: 21440 | consumed tokens: 43909120 | elapsed time per iteration (s): 15.25 | learning rate: 7.025E-06 | global batch size: 16 | lm loss: 7.002084E+00 | grad norm: 8.999 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1341/ 128728 | consumed samples: 21456 | consumed tokens: 43941888 | elapsed time per iteration (s): 15.24 | learning rate: 7.031E-06 | global batch size: 16 | lm loss: 7.157794E+00 | grad norm: 1.076 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1342/ 128728 | consumed samples: 21472 | consumed tokens: 43974656 | elapsed time per iteration (s): 15.23 | learning rate: 7.036E-06 | global batch size: 16 | lm loss: 6.872018E+00 | grad norm: 0.837 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1343/ 128728 | consumed samples: 21488 | consumed tokens: 44007424 | elapsed time per iteration (s): 15.21 | learning rate: 7.041E-06 | global batch size: 16 | lm loss: 6.791720E+00 | grad norm: 0.921 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1344/ 128728 | consumed samples: 21504 | consumed tokens: 44040192 | elapsed time per iteration (s): 15.25 | learning rate: 7.046E-06 | global batch size: 16 | lm loss: 6.878177E+00 | grad norm: 0.832 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1345/ 128728 | consumed samples: 21520 | consumed tokens: 44072960 | elapsed time per iteration (s): 15.24 | learning rate: 7.052E-06 | global batch size: 16 | lm loss: 6.884387E+00 | grad norm: 0.992 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1346/ 128728 | consumed samples: 21536 | consumed tokens: 44105728 | elapsed time per iteration (s): 15.24 | learning rate: 7.057E-06 | global batch size: 16 | lm loss: 6.997211E+00 | grad norm: 0.873 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1347/ 128728 | consumed samples: 21552 | consumed tokens: 44138496 | elapsed time per iteration (s): 15.26 | learning rate: 7.062E-06 | global batch size: 16 | lm loss: 7.032449E+00 | grad norm: 1.045 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1348/ 128728 | consumed samples: 21568 | consumed tokens: 44171264 | elapsed time per iteration (s): 15.23 | learning rate: 7.067E-06 | global batch size: 16 | lm loss: 7.008165E+00 | grad norm: 0.936 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1349/ 128728 | consumed samples: 21584 | consumed tokens: 44204032 | elapsed time per iteration (s): 15.21 | learning rate: 7.073E-06 | global batch size: 16 | lm loss: 7.024583E+00 | grad norm: 0.965 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1350/ 128728 | consumed samples: 21600 | consumed tokens: 44236800 | elapsed time per iteration (s): 15.27 | learning rate: 7.078E-06 | global batch size: 16 | lm loss: 6.845006E+00 | grad norm: 2.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1351/ 128728 | consumed samples: 21616 | consumed tokens: 44269568 | elapsed time per iteration (s): 15.24 | learning rate: 7.083E-06 | global batch size: 16 | lm loss: 6.779938E+00 | grad norm: 1.443 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1352/ 128728 | consumed samples: 21632 | consumed tokens: 44302336 | elapsed time per iteration (s): 15.26 | learning rate: 7.088E-06 | global batch size: 16 | lm loss: 6.868844E+00 | grad norm: 1.217 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1353/ 128728 | consumed samples: 21648 | consumed tokens: 44335104 | elapsed time per iteration (s): 15.26 | learning rate: 7.094E-06 | global batch size: 16 | lm loss: 7.071971E+00 | grad norm: 0.896 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1354/ 128728 | consumed samples: 21664 | consumed tokens: 44367872 | elapsed time per iteration (s): 15.24 | learning rate: 7.099E-06 | global batch size: 16 | lm loss: 6.839797E+00 | grad norm: 0.930 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1355/ 128728 | consumed samples: 21680 | consumed tokens: 44400640 | elapsed time per iteration (s): 15.20 | learning rate: 7.104E-06 | global batch size: 16 | lm loss: 6.854629E+00 | grad norm: 1.044 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1356/ 128728 | consumed samples: 21696 | consumed tokens: 44433408 | elapsed time per iteration (s): 15.28 | learning rate: 7.109E-06 | global batch size: 16 | lm loss: 6.967502E+00 | grad norm: 1.278 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 1357/ 128728 | consumed samples: 21712 | consumed tokens: 44466176 | elapsed time per iteration (s): 15.24 | learning rate: 7.115E-06 | global batch size: 16 | lm loss: 7.005933E+00 | grad norm: 1.271 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1358/ 128728 | consumed samples: 21728 | consumed tokens: 44498944 | elapsed time per iteration (s): 15.26 | learning rate: 7.120E-06 | global batch size: 16 | lm loss: 7.089840E+00 | grad norm: 1.071 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1359/ 128728 | consumed samples: 21744 | consumed tokens: 44531712 | elapsed time per iteration (s): 15.24 | learning rate: 7.125E-06 | global batch size: 16 | lm loss: 6.807289E+00 | grad norm: 0.903 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1360/ 128728 | consumed samples: 21760 | consumed tokens: 44564480 | elapsed time per iteration (s): 15.27 | learning rate: 7.130E-06 | global batch size: 16 | lm loss: 6.980482E+00 | grad norm: 1.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1361/ 128728 | consumed samples: 21776 | consumed tokens: 44597248 | elapsed time per iteration (s): 15.21 | learning rate: 7.136E-06 | global batch size: 16 | lm loss: 6.865876E+00 | grad norm: 0.911 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1362/ 128728 | consumed samples: 21792 | consumed tokens: 44630016 | elapsed time per iteration (s): 15.25 | learning rate: 7.141E-06 | global batch size: 16 | lm loss: 6.621922E+00 | grad norm: 0.870 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1363/ 128728 | consumed samples: 21808 | consumed tokens: 44662784 | elapsed time per iteration (s): 15.26 | learning rate: 7.146E-06 | global batch size: 16 | lm loss: 6.988260E+00 | grad norm: 0.826 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1364/ 128728 | consumed samples: 21824 | consumed tokens: 44695552 | elapsed time per iteration (s): 15.25 | learning rate: 7.151E-06 | global batch size: 16 | lm loss: 7.108578E+00 | grad norm: 1.353 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1365/ 128728 | consumed samples: 21840 | consumed tokens: 44728320 | elapsed time per iteration (s): 15.25 | learning rate: 7.157E-06 | global batch size: 16 | lm loss: 6.960870E+00 | grad norm: 1.106 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1366/ 128728 | consumed samples: 21856 | consumed tokens: 44761088 | elapsed time per iteration (s): 15.25 | learning rate: 7.162E-06 | global batch size: 16 | lm loss: 7.074971E+00 | grad norm: 1.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1367/ 128728 | consumed samples: 21872 | consumed tokens: 44793856 | elapsed time per iteration (s): 15.25 | learning rate: 7.167E-06 | global batch size: 16 | lm loss: 6.846851E+00 | grad norm: 0.942 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1368/ 128728 | consumed samples: 21888 | consumed tokens: 44826624 | elapsed time per iteration (s): 15.21 | learning rate: 7.172E-06 | global batch size: 16 | lm loss: 7.031826E+00 | grad norm: 0.819 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1369/ 128728 | consumed samples: 21904 | consumed tokens: 44859392 | elapsed time per iteration (s): 15.26 | learning rate: 7.178E-06 | global batch size: 16 | lm loss: 6.957930E+00 | grad norm: 1.013 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1370/ 128728 | consumed samples: 21920 | consumed tokens: 44892160 | elapsed time per iteration (s): 15.23 | learning rate: 7.183E-06 | global batch size: 16 | lm loss: 6.889624E+00 | grad norm: 0.884 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1371/ 128728 | consumed samples: 21936 | consumed tokens: 44924928 | elapsed time per iteration (s): 15.22 | learning rate: 7.188E-06 | global batch size: 16 | lm loss: 6.951301E+00 | grad norm: 0.891 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1372/ 128728 | consumed samples: 21952 | consumed tokens: 44957696 | elapsed time per iteration (s): 15.25 | learning rate: 7.193E-06 | global batch size: 16 | lm loss: 7.280240E+00 | grad norm: 1.485 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1373/ 128728 | consumed samples: 21968 | consumed tokens: 44990464 | elapsed time per iteration (s): 15.15 | learning rate: 7.198E-06 | global batch size: 16 | lm loss: 7.068165E+00 | grad norm: 0.824 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 1374/ 128728 | consumed samples: 21984 | consumed tokens: 45023232 | elapsed time per iteration (s): 15.20 | learning rate: 7.204E-06 | global batch size: 16 | lm loss: 6.842229E+00 | grad norm: 0.920 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1375/ 128728 | consumed samples: 22000 | consumed tokens: 45056000 | elapsed time per iteration (s): 15.19 | learning rate: 7.209E-06 | global batch size: 16 | lm loss: 6.986506E+00 | grad norm: 0.884 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 1376/ 128728 | consumed samples: 22016 | consumed tokens: 45088768 | elapsed time per iteration (s): 15.20 | learning rate: 7.214E-06 | global batch size: 16 | lm loss: 6.987074E+00 | grad norm: 0.841 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1377/ 128728 | consumed samples: 22032 | consumed tokens: 45121536 | elapsed time per iteration (s): 15.23 | learning rate: 7.219E-06 | global batch size: 16 | lm loss: 6.934793E+00 | grad norm: 0.941 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1378/ 128728 | consumed samples: 22048 | consumed tokens: 45154304 | elapsed time per iteration (s): 15.25 | learning rate: 7.225E-06 | global batch size: 16 | lm loss: 7.082214E+00 | grad norm: 1.045 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1379/ 128728 | consumed samples: 22064 | consumed tokens: 45187072 | elapsed time per iteration (s): 15.22 | learning rate: 7.230E-06 | global batch size: 16 | lm loss: 6.853665E+00 | grad norm: 0.877 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1380/ 128728 | consumed samples: 22080 | consumed tokens: 45219840 | elapsed time per iteration (s): 15.20 | learning rate: 7.235E-06 | global batch size: 16 | lm loss: 7.111278E+00 | grad norm: 0.814 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1381/ 128728 | consumed samples: 22096 | consumed tokens: 45252608 | elapsed time per iteration (s): 15.24 | learning rate: 7.240E-06 | global batch size: 16 | lm loss: 6.896193E+00 | grad norm: 0.819 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1382/ 128728 | consumed samples: 22112 | consumed tokens: 45285376 | elapsed time per iteration (s): 15.21 | learning rate: 7.246E-06 | global batch size: 16 | lm loss: 6.947161E+00 | grad norm: 0.838 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1383/ 128728 | consumed samples: 22128 | consumed tokens: 45318144 | elapsed time per iteration (s): 15.23 | learning rate: 7.251E-06 | global batch size: 16 | lm loss: 7.008558E+00 | grad norm: 0.914 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1384/ 128728 | consumed samples: 22144 | consumed tokens: 45350912 | elapsed time per iteration (s): 15.22 | learning rate: 7.256E-06 | global batch size: 16 | lm loss: 6.837069E+00 | grad norm: 0.858 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1385/ 128728 | consumed samples: 22160 | consumed tokens: 45383680 | elapsed time per iteration (s): 15.23 | learning rate: 7.261E-06 | global batch size: 16 | lm loss: 6.870586E+00 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1386/ 128728 | consumed samples: 22176 | consumed tokens: 45416448 | elapsed time per iteration (s): 15.25 | learning rate: 7.267E-06 | global batch size: 16 | lm loss: 6.940043E+00 | grad norm: 1.014 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1387/ 128728 | consumed samples: 22192 | consumed tokens: 45449216 | elapsed time per iteration (s): 15.24 | learning rate: 7.272E-06 | global batch size: 16 | lm loss: 6.792444E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1388/ 128728 | consumed samples: 22208 | consumed tokens: 45481984 | elapsed time per iteration (s): 15.22 | learning rate: 7.277E-06 | global batch size: 16 | lm loss: 6.868528E+00 | grad norm: 0.913 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1389/ 128728 | consumed samples: 22224 | consumed tokens: 45514752 | elapsed time per iteration (s): 15.23 | learning rate: 7.282E-06 | global batch size: 16 | lm loss: 6.799677E+00 | grad norm: 0.833 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1390/ 128728 | consumed samples: 22240 | consumed tokens: 45547520 | elapsed time per iteration (s): 15.24 | learning rate: 7.288E-06 | global batch size: 16 | lm loss: 7.063715E+00 | grad norm: 1.114 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1391/ 128728 | consumed samples: 22256 | consumed tokens: 45580288 | elapsed time per iteration (s): 15.24 | learning rate: 7.293E-06 | global batch size: 16 | lm loss: 7.192670E+00 | grad norm: 1.073 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1392/ 128728 | consumed samples: 22272 | consumed tokens: 45613056 | elapsed time per iteration (s): 15.24 | learning rate: 7.298E-06 | global batch size: 16 | lm loss: 6.923073E+00 | grad norm: 1.036 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1393/ 128728 | consumed samples: 22288 | consumed tokens: 45645824 | elapsed time per iteration (s): 15.20 | learning rate: 7.303E-06 | global batch size: 16 | lm loss: 7.126211E+00 | grad norm: 0.952 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1394/ 128728 | consumed samples: 22304 | consumed tokens: 45678592 | elapsed time per iteration (s): 15.22 | learning rate: 7.309E-06 | global batch size: 16 | lm loss: 6.779180E+00 | grad norm: 0.917 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1395/ 128728 | consumed samples: 22320 | consumed tokens: 45711360 | elapsed time per iteration (s): 15.20 | learning rate: 7.314E-06 | global batch size: 16 | lm loss: 6.857265E+00 | grad norm: 0.869 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1396/ 128728 | consumed samples: 22336 | consumed tokens: 45744128 | elapsed time per iteration (s): 15.22 | learning rate: 7.319E-06 | global batch size: 16 | lm loss: 7.133854E+00 | grad norm: 0.870 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1397/ 128728 | consumed samples: 22352 | consumed tokens: 45776896 | elapsed time per iteration (s): 15.18 | learning rate: 7.324E-06 | global batch size: 16 | lm loss: 6.703786E+00 | grad norm: 0.853 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 1398/ 128728 | consumed samples: 22368 | consumed tokens: 45809664 | elapsed time per iteration (s): 15.24 | learning rate: 7.330E-06 | global batch size: 16 | lm loss: 6.965015E+00 | grad norm: 1.005 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1399/ 128728 | consumed samples: 22384 | consumed tokens: 45842432 | elapsed time per iteration (s): 15.26 | learning rate: 7.335E-06 | global batch size: 16 | lm loss: 7.162525E+00 | grad norm: 3.114 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1400/ 128728 | consumed samples: 22400 | consumed tokens: 45875200 | elapsed time per iteration (s): 15.23 | learning rate: 7.340E-06 | global batch size: 16 | lm loss: 7.012984E+00 | grad norm: 1.054 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1401/ 128728 | consumed samples: 22416 | consumed tokens: 45907968 | elapsed time per iteration (s): 15.24 | learning rate: 7.345E-06 | global batch size: 16 | lm loss: 6.882190E+00 | grad norm: 1.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1402/ 128728 | consumed samples: 22432 | consumed tokens: 45940736 | elapsed time per iteration (s): 15.25 | learning rate: 7.351E-06 | global batch size: 16 | lm loss: 6.990128E+00 | grad norm: 1.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1403/ 128728 | consumed samples: 22448 | consumed tokens: 45973504 | elapsed time per iteration (s): 15.25 | learning rate: 7.356E-06 | global batch size: 16 | lm loss: 6.974439E+00 | grad norm: 0.963 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1404/ 128728 | consumed samples: 22464 | consumed tokens: 46006272 | elapsed time per iteration (s): 15.31 | learning rate: 7.361E-06 | global batch size: 16 | lm loss: 6.965978E+00 | grad norm: 1.030 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.045 | TFLOPs: 8.00 | [default7]: iteration 1405/ 128728 | consumed samples: 22480 | consumed tokens: 46039040 | elapsed time per iteration (s): 15.24 | learning rate: 7.366E-06 | global batch size: 16 | lm loss: 6.979227E+00 | grad norm: 1.077 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1406/ 128728 | consumed samples: 22496 | consumed tokens: 46071808 | elapsed time per iteration (s): 15.23 | learning rate: 7.372E-06 | global batch size: 16 | lm loss: 6.995125E+00 | grad norm: 0.916 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1407/ 128728 | consumed samples: 22512 | consumed tokens: 46104576 | elapsed time per iteration (s): 15.29 | learning rate: 7.377E-06 | global batch size: 16 | lm loss: 7.185478E+00 | grad norm: 1.070 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.046 | TFLOPs: 8.01 | [default7]: iteration 1408/ 128728 | consumed samples: 22528 | consumed tokens: 46137344 | elapsed time per iteration (s): 15.24 | learning rate: 7.382E-06 | global batch size: 16 | lm loss: 6.939486E+00 | grad norm: 1.078 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1409/ 128728 | consumed samples: 22544 | consumed tokens: 46170112 | elapsed time per iteration (s): 15.23 | learning rate: 7.387E-06 | global batch size: 16 | lm loss: 6.841136E+00 | grad norm: 0.863 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1410/ 128728 | consumed samples: 22560 | consumed tokens: 46202880 | elapsed time per iteration (s): 15.24 | learning rate: 7.392E-06 | global batch size: 16 | lm loss: 6.966883E+00 | grad norm: 0.875 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1411/ 128728 | consumed samples: 22576 | consumed tokens: 46235648 | elapsed time per iteration (s): 15.24 | learning rate: 7.398E-06 | global batch size: 16 | lm loss: 6.978424E+00 | grad norm: 1.322 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1412/ 128728 | consumed samples: 22592 | consumed tokens: 46268416 | elapsed time per iteration (s): 15.27 | learning rate: 7.403E-06 | global batch size: 16 | lm loss: 6.881705E+00 | grad norm: 1.066 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1413/ 128728 | consumed samples: 22608 | consumed tokens: 46301184 | elapsed time per iteration (s): 15.22 | learning rate: 7.408E-06 | global batch size: 16 | lm loss: 6.892154E+00 | grad norm: 0.994 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1414/ 128728 | consumed samples: 22624 | consumed tokens: 46333952 | elapsed time per iteration (s): 15.23 | learning rate: 7.413E-06 | global batch size: 16 | lm loss: 6.848379E+00 | grad norm: 0.873 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1415/ 128728 | consumed samples: 22640 | consumed tokens: 46366720 | elapsed time per iteration (s): 15.22 | learning rate: 7.419E-06 | global batch size: 16 | lm loss: 6.779110E+00 | grad norm: 1.077 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1416/ 128728 | consumed samples: 22656 | consumed tokens: 46399488 | elapsed time per iteration (s): 15.21 | learning rate: 7.424E-06 | global batch size: 16 | lm loss: 7.056311E+00 | grad norm: 1.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1417/ 128728 | consumed samples: 22672 | consumed tokens: 46432256 | elapsed time per iteration (s): 15.23 | learning rate: 7.429E-06 | global batch size: 16 | lm loss: 6.982561E+00 | grad norm: 1.346 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1418/ 128728 | consumed samples: 22688 | consumed tokens: 46465024 | elapsed time per iteration (s): 15.22 | learning rate: 7.434E-06 | global batch size: 16 | lm loss: 6.817053E+00 | grad norm: 0.953 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1419/ 128728 | consumed samples: 22704 | consumed tokens: 46497792 | elapsed time per iteration (s): 15.21 | learning rate: 7.440E-06 | global batch size: 16 | lm loss: 6.851241E+00 | grad norm: 0.980 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1420/ 128728 | consumed samples: 22720 | consumed tokens: 46530560 | elapsed time per iteration (s): 15.21 | learning rate: 7.445E-06 | global batch size: 16 | lm loss: 7.001087E+00 | grad norm: 1.378 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1421/ 128728 | consumed samples: 22736 | consumed tokens: 46563328 | elapsed time per iteration (s): 15.25 | learning rate: 7.450E-06 | global batch size: 16 | lm loss: 6.835620E+00 | grad norm: 0.870 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1422/ 128728 | consumed samples: 22752 | consumed tokens: 46596096 | elapsed time per iteration (s): 15.26 | learning rate: 7.455E-06 | global batch size: 16 | lm loss: 7.090675E+00 | grad norm: 1.311 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1423/ 128728 | consumed samples: 22768 | consumed tokens: 46628864 | elapsed time per iteration (s): 15.26 | learning rate: 7.461E-06 | global batch size: 16 | lm loss: 6.860411E+00 | grad norm: 1.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1424/ 128728 | consumed samples: 22784 | consumed tokens: 46661632 | elapsed time per iteration (s): 15.29 | learning rate: 7.466E-06 | global batch size: 16 | lm loss: 6.869511E+00 | grad norm: 0.806 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.01 | [default7]: iteration 1425/ 128728 | consumed samples: 22800 | consumed tokens: 46694400 | elapsed time per iteration (s): 15.24 | learning rate: 7.471E-06 | global batch size: 16 | lm loss: 6.994910E+00 | grad norm: 0.957 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1426/ 128728 | consumed samples: 22816 | consumed tokens: 46727168 | elapsed time per iteration (s): 15.27 | learning rate: 7.476E-06 | global batch size: 16 | lm loss: 7.125865E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1427/ 128728 | consumed samples: 22832 | consumed tokens: 46759936 | elapsed time per iteration (s): 15.22 | learning rate: 7.482E-06 | global batch size: 16 | lm loss: 7.039857E+00 | grad norm: 0.843 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1428/ 128728 | consumed samples: 22848 | consumed tokens: 46792704 | elapsed time per iteration (s): 15.24 | learning rate: 7.487E-06 | global batch size: 16 | lm loss: 6.739178E+00 | grad norm: 1.490 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1429/ 128728 | consumed samples: 22864 | consumed tokens: 46825472 | elapsed time per iteration (s): 15.23 | learning rate: 7.492E-06 | global batch size: 16 | lm loss: 6.688814E+00 | grad norm: 1.071 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1430/ 128728 | consumed samples: 22880 | consumed tokens: 46858240 | elapsed time per iteration (s): 15.24 | learning rate: 7.497E-06 | global batch size: 16 | lm loss: 6.988650E+00 | grad norm: 0.972 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1431/ 128728 | consumed samples: 22896 | consumed tokens: 46891008 | elapsed time per iteration (s): 15.20 | learning rate: 7.503E-06 | global batch size: 16 | lm loss: 7.054319E+00 | grad norm: 0.973 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1432/ 128728 | consumed samples: 22912 | consumed tokens: 46923776 | elapsed time per iteration (s): 15.24 | learning rate: 7.508E-06 | global batch size: 16 | lm loss: 6.973002E+00 | grad norm: 1.011 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1433/ 128728 | consumed samples: 22928 | consumed tokens: 46956544 | elapsed time per iteration (s): 15.31 | learning rate: 7.513E-06 | global batch size: 16 | lm loss: 6.700475E+00 | grad norm: 0.836 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.045 | TFLOPs: 8.00 | [default7]: iteration 1434/ 128728 | consumed samples: 22944 | consumed tokens: 46989312 | elapsed time per iteration (s): 15.29 | learning rate: 7.518E-06 | global batch size: 16 | lm loss: 7.003654E+00 | grad norm: 1.084 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.046 | TFLOPs: 8.01 | [default7]: iteration 1435/ 128728 | consumed samples: 22960 | consumed tokens: 47022080 | elapsed time per iteration (s): 15.22 | learning rate: 7.524E-06 | global batch size: 16 | lm loss: 6.904319E+00 | grad norm: 0.940 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1436/ 128728 | consumed samples: 22976 | consumed tokens: 47054848 | elapsed time per iteration (s): 15.19 | learning rate: 7.529E-06 | global batch size: 16 | lm loss: 6.922503E+00 | grad norm: 0.974 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1437/ 128728 | consumed samples: 22992 | consumed tokens: 47087616 | elapsed time per iteration (s): 15.21 | learning rate: 7.534E-06 | global batch size: 16 | lm loss: 6.798236E+00 | grad norm: 0.865 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1438/ 128728 | consumed samples: 23008 | consumed tokens: 47120384 | elapsed time per iteration (s): 15.23 | learning rate: 7.539E-06 | global batch size: 16 | lm loss: 6.820006E+00 | grad norm: 0.937 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1439/ 128728 | consumed samples: 23024 | consumed tokens: 47153152 | elapsed time per iteration (s): 15.27 | learning rate: 7.545E-06 | global batch size: 16 | lm loss: 6.920378E+00 | grad norm: 0.937 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1440/ 128728 | consumed samples: 23040 | consumed tokens: 47185920 | elapsed time per iteration (s): 15.27 | learning rate: 7.550E-06 | global batch size: 16 | lm loss: 6.835717E+00 | grad norm: 1.298 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1441/ 128728 | consumed samples: 23056 | consumed tokens: 47218688 | elapsed time per iteration (s): 15.24 | learning rate: 7.555E-06 | global batch size: 16 | lm loss: 6.969578E+00 | grad norm: 0.995 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1442/ 128728 | consumed samples: 23072 | consumed tokens: 47251456 | elapsed time per iteration (s): 15.22 | learning rate: 7.560E-06 | global batch size: 16 | lm loss: 6.877041E+00 | grad norm: 0.963 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1443/ 128728 | consumed samples: 23088 | consumed tokens: 47284224 | elapsed time per iteration (s): 15.22 | learning rate: 7.565E-06 | global batch size: 16 | lm loss: 6.828847E+00 | grad norm: 0.968 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1444/ 128728 | consumed samples: 23104 | consumed tokens: 47316992 | elapsed time per iteration (s): 15.23 | learning rate: 7.571E-06 | global batch size: 16 | lm loss: 7.017298E+00 | grad norm: 1.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1445/ 128728 | consumed samples: 23120 | consumed tokens: 47349760 | elapsed time per iteration (s): 15.25 | learning rate: 7.576E-06 | global batch size: 16 | lm loss: 6.892804E+00 | grad norm: 1.049 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1446/ 128728 | consumed samples: 23136 | consumed tokens: 47382528 | elapsed time per iteration (s): 15.18 | learning rate: 7.581E-06 | global batch size: 16 | lm loss: 6.857821E+00 | grad norm: 0.868 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 1447/ 128728 | consumed samples: 23152 | consumed tokens: 47415296 | elapsed time per iteration (s): 15.24 | learning rate: 7.586E-06 | global batch size: 16 | lm loss: 6.927748E+00 | grad norm: 1.025 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1448/ 128728 | consumed samples: 23168 | consumed tokens: 47448064 | elapsed time per iteration (s): 15.25 | learning rate: 7.592E-06 | global batch size: 16 | lm loss: 6.929221E+00 | grad norm: 1.089 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1449/ 128728 | consumed samples: 23184 | consumed tokens: 47480832 | elapsed time per iteration (s): 15.26 | learning rate: 7.597E-06 | global batch size: 16 | lm loss: 6.774077E+00 | grad norm: 1.585 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1450/ 128728 | consumed samples: 23200 | consumed tokens: 47513600 | elapsed time per iteration (s): 15.27 | learning rate: 7.602E-06 | global batch size: 16 | lm loss: 6.842887E+00 | grad norm: 1.032 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1451/ 128728 | consumed samples: 23216 | consumed tokens: 47546368 | elapsed time per iteration (s): 15.23 | learning rate: 7.607E-06 | global batch size: 16 | lm loss: 6.983165E+00 | grad norm: 0.870 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1452/ 128728 | consumed samples: 23232 | consumed tokens: 47579136 | elapsed time per iteration (s): 15.21 | learning rate: 7.613E-06 | global batch size: 16 | lm loss: 6.892272E+00 | grad norm: 0.899 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1453/ 128728 | consumed samples: 23248 | consumed tokens: 47611904 | elapsed time per iteration (s): 15.23 | learning rate: 7.618E-06 | global batch size: 16 | lm loss: 6.959459E+00 | grad norm: 0.848 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1454/ 128728 | consumed samples: 23264 | consumed tokens: 47644672 | elapsed time per iteration (s): 15.23 | learning rate: 7.623E-06 | global batch size: 16 | lm loss: 6.613215E+00 | grad norm: 1.072 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1455/ 128728 | consumed samples: 23280 | consumed tokens: 47677440 | elapsed time per iteration (s): 15.24 | learning rate: 7.628E-06 | global batch size: 16 | lm loss: 6.947182E+00 | grad norm: 0.989 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1456/ 128728 | consumed samples: 23296 | consumed tokens: 47710208 | elapsed time per iteration (s): 15.24 | learning rate: 7.634E-06 | global batch size: 16 | lm loss: 6.893425E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1457/ 128728 | consumed samples: 23312 | consumed tokens: 47742976 | elapsed time per iteration (s): 15.26 | learning rate: 7.639E-06 | global batch size: 16 | lm loss: 6.631948E+00 | grad norm: 0.894 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1458/ 128728 | consumed samples: 23328 | consumed tokens: 47775744 | elapsed time per iteration (s): 15.22 | learning rate: 7.644E-06 | global batch size: 16 | lm loss: 7.102271E+00 | grad norm: 0.988 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1459/ 128728 | consumed samples: 23344 | consumed tokens: 47808512 | elapsed time per iteration (s): 15.21 | learning rate: 7.649E-06 | global batch size: 16 | lm loss: 6.629117E+00 | grad norm: 0.898 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1460/ 128728 | consumed samples: 23360 | consumed tokens: 47841280 | elapsed time per iteration (s): 15.25 | learning rate: 7.655E-06 | global batch size: 16 | lm loss: 6.952769E+00 | grad norm: 1.593 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1461/ 128728 | consumed samples: 23376 | consumed tokens: 47874048 | elapsed time per iteration (s): 15.26 | learning rate: 7.660E-06 | global batch size: 16 | lm loss: 6.996358E+00 | grad norm: 0.966 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1462/ 128728 | consumed samples: 23392 | consumed tokens: 47906816 | elapsed time per iteration (s): 15.24 | learning rate: 7.665E-06 | global batch size: 16 | lm loss: 6.833821E+00 | grad norm: 0.861 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1463/ 128728 | consumed samples: 23408 | consumed tokens: 47939584 | elapsed time per iteration (s): 15.23 | learning rate: 7.670E-06 | global batch size: 16 | lm loss: 6.710407E+00 | grad norm: 0.864 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1464/ 128728 | consumed samples: 23424 | consumed tokens: 47972352 | elapsed time per iteration (s): 15.16 | learning rate: 7.676E-06 | global batch size: 16 | lm loss: 6.818951E+00 | grad norm: 0.767 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 1465/ 128728 | consumed samples: 23440 | consumed tokens: 48005120 | elapsed time per iteration (s): 15.21 | learning rate: 7.681E-06 | global batch size: 16 | lm loss: 6.974868E+00 | grad norm: 0.856 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1466/ 128728 | consumed samples: 23456 | consumed tokens: 48037888 | elapsed time per iteration (s): 15.17 | learning rate: 7.686E-06 | global batch size: 16 | lm loss: 6.911908E+00 | grad norm: 0.738 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 1467/ 128728 | consumed samples: 23472 | consumed tokens: 48070656 | elapsed time per iteration (s): 15.21 | learning rate: 7.691E-06 | global batch size: 16 | lm loss: 6.894742E+00 | grad norm: 0.838 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1468/ 128728 | consumed samples: 23488 | consumed tokens: 48103424 | elapsed time per iteration (s): 15.23 | learning rate: 7.697E-06 | global batch size: 16 | lm loss: 6.738654E+00 | grad norm: 1.618 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1469/ 128728 | consumed samples: 23504 | consumed tokens: 48136192 | elapsed time per iteration (s): 15.20 | learning rate: 7.702E-06 | global batch size: 16 | lm loss: 6.781757E+00 | grad norm: 0.941 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1470/ 128728 | consumed samples: 23520 | consumed tokens: 48168960 | elapsed time per iteration (s): 15.22 | learning rate: 7.707E-06 | global batch size: 16 | lm loss: 6.828523E+00 | grad norm: 1.455 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1471/ 128728 | consumed samples: 23536 | consumed tokens: 48201728 | elapsed time per iteration (s): 15.25 | learning rate: 7.712E-06 | global batch size: 16 | lm loss: 6.891495E+00 | grad norm: 1.065 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1472/ 128728 | consumed samples: 23552 | consumed tokens: 48234496 | elapsed time per iteration (s): 15.21 | learning rate: 7.718E-06 | global batch size: 16 | lm loss: 6.899791E+00 | grad norm: 0.824 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1473/ 128728 | consumed samples: 23568 | consumed tokens: 48267264 | elapsed time per iteration (s): 15.21 | learning rate: 7.723E-06 | global batch size: 16 | lm loss: 6.920649E+00 | grad norm: 0.886 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1474/ 128728 | consumed samples: 23584 | consumed tokens: 48300032 | elapsed time per iteration (s): 15.19 | learning rate: 7.728E-06 | global batch size: 16 | lm loss: 6.843232E+00 | grad norm: 0.914 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1475/ 128728 | consumed samples: 23600 | consumed tokens: 48332800 | elapsed time per iteration (s): 15.21 | learning rate: 7.733E-06 | global batch size: 16 | lm loss: 6.969937E+00 | grad norm: 0.837 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1476/ 128728 | consumed samples: 23616 | consumed tokens: 48365568 | elapsed time per iteration (s): 15.20 | learning rate: 7.739E-06 | global batch size: 16 | lm loss: 6.757054E+00 | grad norm: 0.953 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1477/ 128728 | consumed samples: 23632 | consumed tokens: 48398336 | elapsed time per iteration (s): 15.21 | learning rate: 7.744E-06 | global batch size: 16 | lm loss: 6.830835E+00 | grad norm: 0.807 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1478/ 128728 | consumed samples: 23648 | consumed tokens: 48431104 | elapsed time per iteration (s): 15.24 | learning rate: 7.749E-06 | global batch size: 16 | lm loss: 6.957498E+00 | grad norm: 0.985 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1479/ 128728 | consumed samples: 23664 | consumed tokens: 48463872 | elapsed time per iteration (s): 15.24 | learning rate: 7.754E-06 | global batch size: 16 | lm loss: 6.825459E+00 | grad norm: 1.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1480/ 128728 | consumed samples: 23680 | consumed tokens: 48496640 | elapsed time per iteration (s): 15.23 | learning rate: 7.759E-06 | global batch size: 16 | lm loss: 6.764349E+00 | grad norm: 0.950 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1481/ 128728 | consumed samples: 23696 | consumed tokens: 48529408 | elapsed time per iteration (s): 15.27 | learning rate: 7.765E-06 | global batch size: 16 | lm loss: 6.935419E+00 | grad norm: 1.006 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1482/ 128728 | consumed samples: 23712 | consumed tokens: 48562176 | elapsed time per iteration (s): 15.24 | learning rate: 7.770E-06 | global batch size: 16 | lm loss: 6.933623E+00 | grad norm: 1.065 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1483/ 128728 | consumed samples: 23728 | consumed tokens: 48594944 | elapsed time per iteration (s): 15.18 | learning rate: 7.775E-06 | global batch size: 16 | lm loss: 6.809566E+00 | grad norm: 0.787 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 1484/ 128728 | consumed samples: 23744 | consumed tokens: 48627712 | elapsed time per iteration (s): 15.28 | learning rate: 7.780E-06 | global batch size: 16 | lm loss: 6.744482E+00 | grad norm: 1.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 1485/ 128728 | consumed samples: 23760 | consumed tokens: 48660480 | elapsed time per iteration (s): 15.25 | learning rate: 7.786E-06 | global batch size: 16 | lm loss: 6.929039E+00 | grad norm: 0.837 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.04 | [default7]: iteration 1486/ 128728 | consumed samples: 23776 | consumed tokens: 48693248 | elapsed time per iteration (s): 15.25 | learning rate: 7.791E-06 | global batch size: 16 | lm loss: 6.843914E+00 | grad norm: 1.031 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1487/ 128728 | consumed samples: 23792 | consumed tokens: 48726016 | elapsed time per iteration (s): 15.24 | learning rate: 7.796E-06 | global batch size: 16 | lm loss: 7.174544E+00 | grad norm: 1.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1488/ 128728 | consumed samples: 23808 | consumed tokens: 48758784 | elapsed time per iteration (s): 15.24 | learning rate: 7.801E-06 | global batch size: 16 | lm loss: 6.827503E+00 | grad norm: 0.858 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1489/ 128728 | consumed samples: 23824 | consumed tokens: 48791552 | elapsed time per iteration (s): 15.25 | learning rate: 7.807E-06 | global batch size: 16 | lm loss: 6.747015E+00 | grad norm: 1.034 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1490/ 128728 | consumed samples: 23840 | consumed tokens: 48824320 | elapsed time per iteration (s): 15.20 | learning rate: 7.812E-06 | global batch size: 16 | lm loss: 6.738760E+00 | grad norm: 0.822 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1491/ 128728 | consumed samples: 23856 | consumed tokens: 48857088 | elapsed time per iteration (s): 15.25 | learning rate: 7.817E-06 | global batch size: 16 | lm loss: 6.907768E+00 | grad norm: 0.819 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1492/ 128728 | consumed samples: 23872 | consumed tokens: 48889856 | elapsed time per iteration (s): 15.25 | learning rate: 7.822E-06 | global batch size: 16 | lm loss: 6.860197E+00 | grad norm: 1.022 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1493/ 128728 | consumed samples: 23888 | consumed tokens: 48922624 | elapsed time per iteration (s): 15.23 | learning rate: 7.828E-06 | global batch size: 16 | lm loss: 6.858501E+00 | grad norm: 1.000 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1494/ 128728 | consumed samples: 23904 | consumed tokens: 48955392 | elapsed time per iteration (s): 15.21 | learning rate: 7.833E-06 | global batch size: 16 | lm loss: 6.810994E+00 | grad norm: 0.820 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1495/ 128728 | consumed samples: 23920 | consumed tokens: 48988160 | elapsed time per iteration (s): 15.24 | learning rate: 7.838E-06 | global batch size: 16 | lm loss: 6.897250E+00 | grad norm: 0.906 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1496/ 128728 | consumed samples: 23936 | consumed tokens: 49020928 | elapsed time per iteration (s): 15.20 | learning rate: 7.843E-06 | global batch size: 16 | lm loss: 7.080896E+00 | grad norm: 1.038 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1497/ 128728 | consumed samples: 23952 | consumed tokens: 49053696 | elapsed time per iteration (s): 15.26 | learning rate: 7.849E-06 | global batch size: 16 | lm loss: 6.848498E+00 | grad norm: 0.873 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1498/ 128728 | consumed samples: 23968 | consumed tokens: 49086464 | elapsed time per iteration (s): 15.23 | learning rate: 7.854E-06 | global batch size: 16 | lm loss: 6.933249E+00 | grad norm: 0.785 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1499/ 128728 | consumed samples: 23984 | consumed tokens: 49119232 | elapsed time per iteration (s): 15.23 | learning rate: 7.859E-06 | global batch size: 16 | lm loss: 7.075923E+00 | grad norm: 0.871 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1500/ 128728 | consumed samples: 24000 | consumed tokens: 49152000 | elapsed time per iteration (s): 15.22 | learning rate: 7.864E-06 | global batch size: 16 | lm loss: 6.872234E+00 | grad norm: 0.823 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default0]:saving checkpoint at iteration 1500 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default1]:[2022-03-03 12:16:52,729] [INFO] [logging.py:69:log_dist] [Rank 1] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/mp_rank_01_model_states.pt [default0]:[2022-03-03 12:16:52,902] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/mp_rank_00_model_states.pt [default5]:[2022-03-03 12:17:14,031] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_41_optim_states.pt [default3]:[2022-03-03 12:17:14,381] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_39_optim_states.pt [default6]:[2022-03-03 12:17:14,467] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_38_optim_states.pt [default6]:[2022-03-03 12:17:14,472] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_42_optim_states.pt [default6]:[2022-03-03 12:17:14,526] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_26_optim_states.pt [default4]:[2022-03-03 12:17:14,479] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_36_optim_states.pt [default0]:[2022-03-03 12:17:14,490] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_36_optim_states.pt [default3]:[2022-03-03 12:17:14,543] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_27_optim_states.pt [default2]:[2022-03-03 12:17:14,579] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_38_optim_states.pt [default1]:[2022-03-03 12:17:14,583] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_37_optim_states.pt [default4]:[2022-03-03 12:17:14,730] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_36_optim_states.pt [default5]:[2022-03-03 12:17:14,762] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_37_optim_states.pt [default2]:[2022-03-03 12:17:14,725] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_26_optim_states.pt [default4]:[2022-03-03 12:17:14,757] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_36_optim_states.pt [default5]:[2022-03-03 12:17:14,826] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_37_optim_states.pt [default2]:[2022-03-03 12:17:14,909] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_38_optim_states.pt [default3]:[2022-03-03 12:17:14,939] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_43_optim_states.pt [default4]:[2022-03-03 12:17:14,931] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_24_optim_states.pt [default7]:[2022-03-03 12:17:14,982] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_27_optim_states.pt [default2]:[2022-03-03 12:17:14,939] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_42_optim_states.pt [default7]:[2022-03-03 12:17:14,982] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_39_optim_states.pt [default6]:[2022-03-03 12:17:14,928] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_38_optim_states.pt [default3]:[2022-03-03 12:17:14,994] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_39_optim_states.pt [default7]:[2022-03-03 12:17:15,013] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_39_optim_states.pt [default7]:[2022-03-03 12:17:15,021] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_43_optim_states.pt [default4]:[2022-03-03 12:17:15,035] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_40_optim_states.pt [default2]:[2022-03-03 12:17:15,079] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_38_optim_states.pt [default7]:[2022-03-03 12:17:15,039] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_39_optim_states.pt [default5]:[2022-03-03 12:17:15,138] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_37_optim_states.pt [default1]:[2022-03-03 12:17:15,173] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_37_optim_states.pt [default6]:[2022-03-03 12:17:15,156] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_38_optim_states.pt [default0]:[2022-03-03 12:17:15,293] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_36_optim_states.pt [default3]:[2022-03-03 12:17:15,262] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_39_optim_states.pt [default0]:[2022-03-03 12:17:15,326] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_24_optim_states.pt [default0]:[2022-03-03 12:17:15,361] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_40_optim_states.pt [default5]:[2022-03-03 12:17:15,459] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_29_optim_states.pt [default5]:[2022-03-03 12:17:15,488] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_41_optim_states.pt [default1]:[2022-03-03 12:17:15,457] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_41_optim_states.pt [default6]:[2022-03-03 12:17:15,585] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_42_optim_states.pt [default0]:[2022-03-03 12:17:15,622] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_36_optim_states.pt [default2]:[2022-03-03 12:17:15,597] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_42_optim_states.pt [default1]:[2022-03-03 12:17:15,550] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_25_optim_states.pt [default5]:[2022-03-03 12:17:15,612] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_25_optim_states.pt [default1]:[2022-03-03 12:17:15,659] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_37_optim_states.pt [default1]:[2022-03-03 12:17:15,681] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_41_optim_states.pt [default4]:[2022-03-03 12:17:15,922] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_40_optim_states.pt [default7]:[2022-03-03 12:17:15,895] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_43_optim_states.pt [default3]:[2022-03-03 12:17:15,956] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_43_optim_states.pt [default0]:[2022-03-03 12:17:16,034] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_40_optim_states.pt [default2]:[2022-03-03 12:17:16,387] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_30_optim_states.pt [default5]:[2022-03-03 12:17:16,459] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_33_optim_states.pt [default0]:[2022-03-03 12:17:16,961] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_32_optim_states.pt [default7]:[2022-03-03 12:17:16,977] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_35_optim_states.pt [default6]:[2022-03-03 12:17:16,931] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_34_optim_states.pt [default1]:[2022-03-03 12:17:17,062] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_33_optim_states.pt [default0]:[2022-03-03 12:17:17,096] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_28_optim_states.pt [default2]:[2022-03-03 12:17:17,182] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_14_optim_states.pt [default1]:[2022-03-03 12:17:17,191] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_29_optim_states.pt [default3]:[2022-03-03 12:17:17,191] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_31_optim_states.pt [default3]:[2022-03-03 12:17:17,329] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_19_optim_states.pt [default4]:[2022-03-03 12:17:17,317] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_32_optim_states.pt [default5]:[2022-03-03 12:17:17,581] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_21_optim_states.pt [default6]:[2022-03-03 12:17:17,562] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_30_optim_states.pt [default3]:[2022-03-03 12:17:17,776] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_35_optim_states.pt [default4]:[2022-03-03 12:17:17,784] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_28_optim_states.pt [default0]:[2022-03-03 12:17:17,918] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_08_optim_states.pt [default2]:[2022-03-03 12:17:17,907] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_34_optim_states.pt [default0]:[2022-03-03 12:17:18,010] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_20_optim_states.pt [default4]:[2022-03-03 12:17:18,006] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_20_optim_states.pt [default5]:[2022-03-03 12:17:18,050] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_21_optim_states.pt [default0]:[2022-03-03 12:17:17,997] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_08_optim_states.pt [default1]:[2022-03-03 12:17:18,051] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_33_optim_states.pt [default2]:[2022-03-03 12:17:18,160] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_18_optim_states.pt [default1]:[2022-03-03 12:17:18,143] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_13_optim_states.pt [default3]:[2022-03-03 12:17:18,184] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_15_optim_states.pt [default7]:[2022-03-03 12:17:18,215] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_31_optim_states.pt [default0]:[2022-03-03 12:17:18,237] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_12_optim_states.pt [default6]:[2022-03-03 12:17:18,249] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_14_optim_states.pt [default2]:[2022-03-03 12:17:18,361] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_22_optim_states.pt [default7]:[2022-03-03 12:17:18,399] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_15_optim_states.pt [default7]:[2022-03-03 12:17:18,519] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_11_optim_states.pt [default6]:[2022-03-03 12:17:18,522] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_10_optim_states.pt [default1]:[2022-03-03 12:17:18,532] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_09_optim_states.pt [default1]:[2022-03-03 12:17:18,653] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_09_optim_states.pt [default3]:[2022-03-03 12:17:18,653] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_03_optim_states.pt [default1]:[2022-03-03 12:17:18,683] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_21_optim_states.pt [default3]:[2022-03-03 12:17:18,725] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_11_optim_states.pt [default0]:[2022-03-03 12:17:18,791] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_16_optim_states.pt [default7]:[2022-03-03 12:17:18,809] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_23_optim_states.pt [default6]:[2022-03-03 12:17:18,882] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_22_optim_states.pt [default3]:[2022-03-03 12:17:18,931] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_23_optim_states.pt [default1]:[2022-03-03 12:17:18,972] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_13_optim_states.pt [default2]:[2022-03-03 12:17:19,093] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_46_optim_states.pt [default0]:[2022-03-03 12:17:19,053] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt [default1]:[2022-03-03 12:17:19,053] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_01_optim_states.pt [default4]:[2022-03-03 12:17:19,076] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt [default3]:[2022-03-03 12:17:19,191] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_19_optim_states.pt [default0]:[2022-03-03 12:17:19,228] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_16_optim_states.pt [default5]:[2022-03-03 12:17:19,229] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_13_optim_states.pt [default2]:[2022-03-03 12:17:19,157] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_22_optim_states.pt [default0]:[2022-03-03 12:17:19,243] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_12_optim_states.pt [default4]:[2022-03-03 12:17:19,273] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_12_optim_states.pt [default7]:[2022-03-03 12:17:19,270] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_15_optim_states.pt [default4]:[2022-03-03 12:17:19,315] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_16_optim_states.pt [default1]:[2022-03-03 12:17:19,338] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_13_optim_states.pt [default7]:[2022-03-03 12:17:19,416] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_43_optim_states.pt [default5]:[2022-03-03 12:17:19,386] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_45_optim_states.pt [default4]:[2022-03-03 12:17:19,430] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_20_optim_states.pt [default6]:[2022-03-03 12:17:19,342] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_14_optim_states.pt [default5]:[2022-03-03 12:17:19,353] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_17_optim_states.pt [default4]:[2022-03-03 12:17:19,422] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_44_optim_states.pt [default1]:[2022-03-03 12:17:19,533] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_17_optim_states.pt [default2]:[2022-03-03 12:17:19,504] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_10_optim_states.pt [default6]:[2022-03-03 12:17:19,580] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_14_optim_states.pt [default2]:[2022-03-03 12:17:19,684] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_14_optim_states.pt [default3]:[2022-03-03 12:17:19,688] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_15_optim_states.pt [default1]:[2022-03-03 12:17:19,729] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_05_optim_states.pt [default2]:[2022-03-03 12:17:19,851] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_30_optim_states.pt [default2]:[2022-03-03 12:17:19,804] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_14_optim_states.pt [default1]:[2022-03-03 12:17:19,831] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_17_optim_states.pt [default7]:[2022-03-03 12:17:19,878] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_47_optim_states.pt [default0]:[2022-03-03 12:17:19,914] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_44_optim_states.pt [default6]:[2022-03-03 12:17:19,932] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_46_optim_states.pt [default7]:[2022-03-03 12:17:19,965] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_31_optim_states.pt [default0]:[2022-03-03 12:17:19,934] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_32_optim_states.pt [default6]:[2022-03-03 12:17:19,973] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_30_optim_states.pt [default0]:[2022-03-03 12:17:20,032] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_12_optim_states.pt [default3]:[2022-03-03 12:17:20,051] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_31_optim_states.pt [default1]:[2022-03-03 12:17:20,011] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_33_optim_states.pt [default3]:[2022-03-03 12:17:20,008] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_35_optim_states.pt [default3]:[2022-03-03 12:17:20,081] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_47_optim_states.pt [default4]:[2022-03-03 12:17:19,987] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt [default6]:[2022-03-03 12:17:20,067] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_34_optim_states.pt [default5]:[2022-03-03 12:17:20,153] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_09_optim_states.pt [default0]:[2022-03-03 12:17:20,090] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_32_optim_states.pt [default3]:[2022-03-03 12:17:20,162] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_15_optim_states.pt [default5]:[2022-03-03 12:17:20,176] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_13_optim_states.pt [default4]:[2022-03-03 12:17:20,145] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_08_optim_states.pt [default2]:[2022-03-03 12:17:20,194] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_14_optim_states.pt [default5]:[2022-03-03 12:17:20,246] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_33_optim_states.pt [default1]:[2022-03-03 12:17:20,197] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_41_optim_states.pt [default4]:[2022-03-03 12:17:20,281] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_12_optim_states.pt [default7]:[2022-03-03 12:17:20,260] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_19_optim_states.pt [default1]:[2022-03-03 12:17:20,352] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_05_optim_states.pt [default7]:[2022-03-03 12:17:20,369] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_07_optim_states.pt [default0]:[2022-03-03 12:17:20,270] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_04_optim_states.pt [default0]:[2022-03-03 12:17:20,425] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_12_optim_states.pt [default6]:[2022-03-03 12:17:20,412] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_46_optim_states.pt [default3]:[2022-03-03 12:17:20,384] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_15_optim_states.pt [default6]:[2022-03-03 12:17:20,426] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_02_optim_states.pt [default7]:[2022-03-03 12:17:20,418] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_15_optim_states.pt [default4]:[2022-03-03 12:17:20,426] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_08_optim_states.pt [default5]:[2022-03-03 12:17:20,468] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt [default5]:[2022-03-03 12:17:20,492] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_01_optim_states.pt [default6]:[2022-03-03 12:17:20,454] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_02_optim_states.pt [default5]:[2022-03-03 12:17:20,546] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_09_optim_states.pt [default2]:[2022-03-03 12:17:20,481] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_18_optim_states.pt [default4]:[2022-03-03 12:17:20,506] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_16_optim_states.pt [default1]:[2022-03-03 12:17:20,520] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_45_optim_states.pt [default5]:[2022-03-03 12:17:20,633] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_05_optim_states.pt [default4]:[2022-03-03 12:17:20,666] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_04_optim_states.pt [default4]:[2022-03-03 12:17:20,638] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_32_optim_states.pt [default2]:[2022-03-03 12:17:20,690] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_02_optim_states.pt [default1]:[2022-03-03 12:17:20,664] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_01_optim_states.pt [default6]:[2022-03-03 12:17:20,720] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_18_optim_states.pt [default2]:[2022-03-03 12:17:20,697] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_10_optim_states.pt [default5]:[2022-03-03 12:17:20,703] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_41_optim_states.pt [default2]:[2022-03-03 12:17:20,818] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_26_optim_states.pt [default4]:[2022-03-03 12:17:20,847] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_12_optim_states.pt [default3]:[2022-03-03 12:17:20,803] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_27_optim_states.pt [default5]:[2022-03-03 12:17:20,587] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_17_optim_states.pt [default7]:[2022-03-03 12:17:20,885] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_03_optim_states.pt [default7]:[2022-03-03 12:17:20,861] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_03_optim_states.pt [default7]:[2022-03-03 12:17:20,900] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_47_optim_states.pt [default5]:[2022-03-03 12:17:20,857] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_13_optim_states.pt [default1]:[2022-03-03 12:17:20,830] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_29_optim_states.pt [default5]:[2022-03-03 12:17:20,920] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_17_optim_states.pt [default1]:[2022-03-03 12:17:20,903] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_45_optim_states.pt [default0]:[2022-03-03 12:17:20,892] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_24_optim_states.pt [default6]:[2022-03-03 12:17:20,948] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_42_optim_states.pt [default2]:[2022-03-03 12:17:21,010] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_42_optim_states.pt [default3]:[2022-03-03 12:17:21,034] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_43_optim_states.pt [default4]:[2022-03-03 12:17:21,068] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_40_optim_states.pt [default3]:[2022-03-03 12:17:21,049] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_35_optim_states.pt [default3]:[2022-03-03 12:17:21,057] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_03_optim_states.pt [default4]:[2022-03-03 12:17:20,843] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_16_optim_states.pt [default1]:[2022-03-03 12:17:21,196] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_29_optim_states.pt [default0]:[2022-03-03 12:17:21,164] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_44_optim_states.pt [default2]:[2022-03-03 12:17:21,142] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_18_optim_states.pt [default3]:[2022-03-03 12:17:21,221] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_19_optim_states.pt [default7]:[2022-03-03 12:17:21,233] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_47_optim_states.pt [default0]:[2022-03-03 12:17:21,292] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_16_optim_states.pt [default2]:[2022-03-03 12:17:21,261] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_06_optim_states.pt [default2]:[2022-03-03 12:17:21,323] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_34_optim_states.pt [default2]:[2022-03-03 12:17:21,393] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_02_optim_states.pt [default5]:[2022-03-03 12:17:21,464] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_01_optim_states.pt [default3]:[2022-03-03 12:17:21,501] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_11_optim_states.pt [default0]:[2022-03-03 12:17:21,505] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt [default3]:[2022-03-03 12:17:21,490] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_43_optim_states.pt [default1]:[2022-03-03 12:17:21,478] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_25_optim_states.pt [default7]:[2022-03-03 12:17:21,463] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_19_optim_states.pt [default4]:[2022-03-03 12:17:21,524] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_40_optim_states.pt [default2]:[2022-03-03 12:17:21,545] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_42_optim_states.pt [default6]:[2022-03-03 12:17:21,502] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_18_optim_states.pt [default3]:[2022-03-03 12:17:21,585] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_23_optim_states.pt [default6]:[2022-03-03 12:17:21,670] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_46_optim_states.pt [default5]:[2022-03-03 12:17:21,690] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_05_optim_states.pt [default2]:[2022-03-03 12:17:21,727] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_22_optim_states.pt [default7]:[2022-03-03 12:17:21,765] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_35_optim_states.pt [default3]:[2022-03-03 12:17:21,784] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_47_optim_states.pt [default2]:[2022-03-03 12:17:21,779] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_46_optim_states.pt [default3]:[2022-03-03 12:17:21,750] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_07_optim_states.pt [default1]:[2022-03-03 12:17:21,846] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_13_optim_states.pt [default0]:[2022-03-03 12:17:21,834] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_40_optim_states.pt [default6]:[2022-03-03 12:17:21,922] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_14_optim_states.pt [default3]:[2022-03-03 12:17:21,947] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_03_optim_states.pt [default7]:[2022-03-03 12:17:22,029] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_11_optim_states.pt [default6]:[2022-03-03 12:17:22,004] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_10_optim_states.pt [default0]:[2022-03-03 12:17:22,095] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_28_optim_states.pt [default3]:[2022-03-03 12:17:22,144] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_35_optim_states.pt [default7]:[2022-03-03 12:17:22,087] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_15_optim_states.pt [default5]:[2022-03-03 12:17:22,266] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_25_optim_states.pt [default3]:[2022-03-03 12:17:22,270] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_07_optim_states.pt [default2]:[2022-03-03 12:17:22,292] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_34_optim_states.pt [default2]:[2022-03-03 12:17:22,263] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_30_optim_states.pt [default5]:[2022-03-03 12:17:22,390] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_41_optim_states.pt [default5]:[2022-03-03 12:17:22,325] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_17_optim_states.pt [default4]:[2022-03-03 12:17:22,392] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_32_optim_states.pt [default5]:[2022-03-03 12:17:22,426] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_45_optim_states.pt [default4]:[2022-03-03 12:17:22,433] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_28_optim_states.pt [default5]:[2022-03-03 12:17:22,457] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_13_optim_states.pt [default3]:[2022-03-03 12:17:22,389] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_31_optim_states.pt [default2]:[2022-03-03 12:17:22,427] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_46_optim_states.pt [default5]:[2022-03-03 12:17:22,428] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_33_optim_states.pt [default4]:[2022-03-03 12:17:22,474] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_24_optim_states.pt [default4]:[2022-03-03 12:17:22,458] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_12_optim_states.pt [default5]:[2022-03-03 12:17:22,507] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_29_optim_states.pt [default2]:[2022-03-03 12:17:22,568] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_02_optim_states.pt [default7]:[2022-03-03 12:17:22,595] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_27_optim_states.pt [default1]:[2022-03-03 12:17:22,592] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_17_optim_states.pt [default4]:[2022-03-03 12:17:22,615] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_44_optim_states.pt [default0]:[2022-03-03 12:17:22,724] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_44_optim_states.pt [default3]:[2022-03-03 12:17:22,745] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_47_optim_states.pt [default3]:[2022-03-03 12:17:22,827] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_27_optim_states.pt [default1]:[2022-03-03 12:17:22,858] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_41_optim_states.pt [default0]:[2022-03-03 12:17:22,972] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_40_optim_states.pt [default2]:[2022-03-03 12:17:23,045] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_06_optim_states.pt [default6]:[2022-03-03 12:17:22,967] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_26_optim_states.pt [default4]:[2022-03-03 12:17:22,994] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_04_optim_states.pt [default2]:[2022-03-03 12:17:23,113] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_22_optim_states.pt [default1]:[2022-03-03 12:17:23,139] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_21_optim_states.pt [default0]:[2022-03-03 12:17:23,185] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_20_optim_states.pt [default7]:[2022-03-03 12:17:23,201] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_23_optim_states.pt [default0]:[2022-03-03 12:17:23,184] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_32_optim_states.pt [default2]:[2022-03-03 12:17:23,179] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_06_optim_states.pt [default0]:[2022-03-03 12:17:23,336] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_20_optim_states.pt [default3]:[2022-03-03 12:17:23,335] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_23_optim_states.pt [default4]:[2022-03-03 12:17:23,529] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt [default2]:[2022-03-03 12:17:23,517] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_26_optim_states.pt [default0]:[2022-03-03 12:17:23,475] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_28_optim_states.pt [default6]:[2022-03-03 12:17:23,532] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_30_optim_states.pt [default1]:[2022-03-03 12:17:23,540] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_25_optim_states.pt [default0]:[2022-03-03 12:17:23,565] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [default7]:[2022-03-03 12:17:23,580] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_03_optim_states.pt [default4]:[2022-03-03 12:17:23,644] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_28_optim_states.pt [default1]:[2022-03-03 12:17:23,781] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_21_optim_states.pt [default3]:[2022-03-03 12:17:23,741] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_23_optim_states.pt [default7]:[2022-03-03 12:17:23,822] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_23_optim_states.pt [default0]:[2022-03-03 12:17:23,923] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_20_optim_states.pt [default6]:[2022-03-03 12:17:23,868] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_22_optim_states.pt [default6]:[2022-03-03 12:17:23,884] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_06_optim_states.pt [default4]:[2022-03-03 12:17:23,886] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_20_optim_states.pt [default3]:[2022-03-03 12:17:23,935] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_27_optim_states.pt [default6]:[2022-03-03 12:17:23,917] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_22_optim_states.pt [default1]:[2022-03-03 12:17:23,923] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_21_optim_states.pt [default7]:[2022-03-03 12:17:23,944] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_35_optim_states.pt [default6]:[2022-03-03 12:17:23,957] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_34_optim_states.pt [default7]:[2022-03-03 12:17:24,015] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_07_optim_states.pt [default1]:[2022-03-03 12:17:24,002] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_01_optim_states.pt [default3]:[2022-03-03 12:17:24,252] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_07_optim_states.pt [default6]:[2022-03-03 12:17:24,253] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_06_optim_states.pt [default2]:[2022-03-03 12:17:24,234] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_46_optim_states.pt [default0]:[2022-03-03 12:17:24,301] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_04_optim_states.pt [default4]:[2022-03-03 12:17:24,302] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_16_optim_states.pt [default5]:[2022-03-03 12:17:24,354] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_21_optim_states.pt [default4]:[2022-03-03 12:17:24,473] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_28_optim_states.pt [default1]:[2022-03-03 12:17:24,511] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_33_optim_states.pt [default2]:[2022-03-03 12:17:24,521] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_34_optim_states.pt [default1]:[2022-03-03 12:17:24,549] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_45_optim_states.pt [default0]:[2022-03-03 12:17:24,583] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt [default6]:[2022-03-03 12:17:24,676] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_18_optim_states.pt [default1]:[2022-03-03 12:17:24,666] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_01_optim_states.pt [default5]:[2022-03-03 12:17:24,580] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_29_optim_states.pt [default7]:[2022-03-03 12:17:24,717] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_19_optim_states.pt [default7]:[2022-03-03 12:17:24,712] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_43_optim_states.pt [default6]:[2022-03-03 12:17:24,835] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_42_optim_states.pt [default7]:[2022-03-03 12:17:24,564] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_31_optim_states.pt [default2]:[2022-03-03 12:17:24,869] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_26_optim_states.pt [default2]:[2022-03-03 12:17:24,930] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_06_optim_states.pt [default6]:[2022-03-03 12:17:25,057] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_02_optim_states.pt [default2]:[2022-03-03 12:17:25,102] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_10_optim_states.pt [default6]:[2022-03-03 12:17:25,041] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_10_optim_states.pt [default7]:[2022-03-03 12:17:25,070] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_11_optim_states.pt [default3]:[2022-03-03 12:17:25,150] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_47_optim_states.pt [default5]:[2022-03-03 12:17:25,161] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_29_optim_states.pt [default4]:[2022-03-03 12:17:25,272] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_04_optim_states.pt [default7]:[2022-03-03 12:17:25,835] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_39_optim_states.pt [default6]:[2022-03-03 12:17:25,925] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_38_optim_states.pt [default0]:[2022-03-03 12:17:26,025] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_24_optim_states.pt [default7]:[2022-03-03 12:17:26,024] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_27_optim_states.pt [default0]:[2022-03-03 12:17:26,361] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_44_optim_states.pt [default1]:[2022-03-03 12:17:26,391] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_45_optim_states.pt [default5]:[2022-03-03 12:17:26,465] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_25_optim_states.pt [default1]:[2022-03-03 12:17:26,511] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_25_optim_states.pt [default2]:[2022-03-03 12:17:26,515] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_10_optim_states.pt [default4]:[2022-03-03 12:17:26,506] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_08_optim_states.pt [default6]:[2022-03-03 12:17:26,518] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_02_optim_states.pt [default7]:[2022-03-03 12:17:26,521] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_03_optim_states.pt [default5]:[2022-03-03 12:17:26,526] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_09_optim_states.pt [default6]:[2022-03-03 12:17:26,556] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_26_optim_states.pt [default3]:[2022-03-03 12:17:26,681] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_03_optim_states.pt [default1]:[2022-03-03 12:17:26,739] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_05_optim_states.pt [default6]:[2022-03-03 12:17:26,807] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_18_optim_states.pt [default3]:[2022-03-03 12:17:26,808] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_31_optim_states.pt [default7]:[2022-03-03 12:17:26,833] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_19_optim_states.pt [default2]:[2022-03-03 12:17:26,817] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_30_optim_states.pt [default1]:[2022-03-03 12:17:26,977] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_17_optim_states.pt [default3]:[2022-03-03 12:17:26,953] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_11_optim_states.pt [default0]:[2022-03-03 12:17:26,979] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_16_optim_states.pt [default4]:[2022-03-03 12:17:27,054] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_24_optim_states.pt [default2]:[2022-03-03 12:17:27,220] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_02_optim_states.pt [default4]:[2022-03-03 12:17:27,382] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_36_optim_states.pt [default5]:[2022-03-03 12:17:27,391] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_37_optim_states.pt [default7]:[2022-03-03 12:17:27,481] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_23_optim_states.pt [default2]:[2022-03-03 12:17:27,576] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_38_optim_states.pt [default6]:[2022-03-03 12:17:27,524] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_22_optim_states.pt [default7]:[2022-03-03 12:17:27,540] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_35_optim_states.pt [default6]:[2022-03-03 12:17:27,557] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_34_optim_states.pt [default5]:[2022-03-03 12:17:27,720] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_33_optim_states.pt [default0]:[2022-03-03 12:17:27,761] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_36_optim_states.pt [default1]:[2022-03-03 12:17:27,758] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_37_optim_states.pt [default0]:[2022-03-03 12:17:27,792] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_04_optim_states.pt [default5]:[2022-03-03 12:17:27,962] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_01_optim_states.pt [default3]:[2022-03-03 12:17:28,009] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_39_optim_states.pt [default1]:[2022-03-03 12:17:28,042] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_29_optim_states.pt [default4]:[2022-03-03 12:17:28,001] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt [default5]:[2022-03-03 12:17:28,060] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_05_optim_states.pt [default5]:[2022-03-03 12:17:28,099] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_05_optim_states.pt [default0]:[2022-03-03 12:17:28,113] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_4_mp_rank_28_optim_states.pt [default6]:[2022-03-03 12:17:28,190] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_06_optim_states.pt [default4]:[2022-03-03 12:17:28,123] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_04_optim_states.pt [default3]:[2022-03-03 12:17:28,162] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_07_optim_states.pt [default4]:[2022-03-03 12:17:28,207] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_32_optim_states.pt [default7]:[2022-03-03 12:17:28,222] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_07_optim_states.pt [default1]:[2022-03-03 12:17:28,241] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_05_optim_states.pt [default3]:[2022-03-03 12:17:28,331] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_11_optim_states.pt [default6]:[2022-03-03 12:17:28,473] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_30_optim_states.pt [default7]:[2022-03-03 12:17:28,464] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_31_optim_states.pt [default0]:[2022-03-03 12:17:28,544] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_08_optim_states.pt [default1]:[2022-03-03 12:17:28,585] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_0_mp_rank_09_optim_states.pt [default5]:[2022-03-03 12:17:29,080] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_25_optim_states.pt [default4]:[2022-03-03 12:17:29,146] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_44_optim_states.pt [default5]:[2022-03-03 12:17:29,222] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_45_optim_states.pt [default4]:[2022-03-03 12:17:30,226] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_24_optim_states.pt [default0]:[2022-03-03 12:17:30,422] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_04_optim_states.pt [default7]:[2022-03-03 12:17:30,489] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_47_optim_states.pt [default5]:[2022-03-03 12:17:30,836] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_21_optim_states.pt [default0]:[2022-03-03 12:17:30,925] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_6_mp_rank_24_optim_states.pt [default4]:[2022-03-03 12:17:30,852] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_5_mp_rank_20_optim_states.pt [default6]:[2022-03-03 12:17:31,033] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_46_optim_states.pt [default3]:[2022-03-03 12:17:31,029] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_19_optim_states.pt [default2]:[2022-03-03 12:17:30,991] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_18_optim_states.pt [default7]:[2022-03-03 12:17:31,421] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_07_optim_states.pt [default5]:[2022-03-03 12:17:31,498] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_09_optim_states.pt [default6]:[2022-03-03 12:17:31,481] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_06_optim_states.pt [default4]:[2022-03-03 12:17:31,498] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_1_mp_rank_08_optim_states.pt [default0]:[2022-03-03 12:17:31,633] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_08_optim_states.pt [default6]:[2022-03-03 12:17:31,972] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_10_optim_states.pt [default7]:[2022-03-03 12:17:32,091] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_11_optim_states.pt [default1]:[2022-03-03 12:17:32,161] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_2_mp_rank_09_optim_states.pt [default4]:[2022-03-03 12:17:33,895] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_44_optim_states.pt [default5]:[2022-03-03 12:17:33,868] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_3_mp_rank_45_optim_states.pt [default6]:[2022-03-03 12:17:35,216] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_26_optim_states.pt [default7]:[2022-03-03 12:17:35,249] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step1500/bf16_zero_pp_rank_7_mp_rank_27_optim_states.pt [default7]:time (ms) | save-checkpoint: 50223.83 [default0]: successfully saved checkpoint at iteration 1500 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default7]: iteration 1501/ 128728 | consumed samples: 24016 | consumed tokens: 49184768 | elapsed time per iteration (s): 65.45 | learning rate: 7.870E-06 | global batch size: 16 | lm loss: 6.884842E+00 | grad norm: 0.832 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 0.244 | TFLOPs: 1.87 | [default7]: iteration 1502/ 128728 | consumed samples: 24032 | consumed tokens: 49217536 | elapsed time per iteration (s): 15.25 | learning rate: 7.875E-06 | global batch size: 16 | lm loss: 6.994694E+00 | grad norm: 1.073 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1503/ 128728 | consumed samples: 24048 | consumed tokens: 49250304 | elapsed time per iteration (s): 15.24 | learning rate: 7.880E-06 | global batch size: 16 | lm loss: 6.964286E+00 | grad norm: 0.935 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1504/ 128728 | consumed samples: 24064 | consumed tokens: 49283072 | elapsed time per iteration (s): 15.26 | learning rate: 7.885E-06 | global batch size: 16 | lm loss: 6.845483E+00 | grad norm: 1.042 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1505/ 128728 | consumed samples: 24080 | consumed tokens: 49315840 | elapsed time per iteration (s): 15.22 | learning rate: 7.891E-06 | global batch size: 16 | lm loss: 6.827108E+00 | grad norm: 0.875 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1506/ 128728 | consumed samples: 24096 | consumed tokens: 49348608 | elapsed time per iteration (s): 15.23 | learning rate: 7.896E-06 | global batch size: 16 | lm loss: 6.897807E+00 | grad norm: 0.875 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1507/ 128728 | consumed samples: 24112 | consumed tokens: 49381376 | elapsed time per iteration (s): 15.19 | learning rate: 7.901E-06 | global batch size: 16 | lm loss: 6.798639E+00 | grad norm: 0.974 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1508/ 128728 | consumed samples: 24128 | consumed tokens: 49414144 | elapsed time per iteration (s): 15.19 | learning rate: 7.906E-06 | global batch size: 16 | lm loss: 6.913240E+00 | grad norm: 0.825 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1509/ 128728 | consumed samples: 24144 | consumed tokens: 49446912 | elapsed time per iteration (s): 15.21 | learning rate: 7.912E-06 | global batch size: 16 | lm loss: 6.769604E+00 | grad norm: 0.838 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1510/ 128728 | consumed samples: 24160 | consumed tokens: 49479680 | elapsed time per iteration (s): 15.27 | learning rate: 7.917E-06 | global batch size: 16 | lm loss: 6.998416E+00 | grad norm: 1.065 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1511/ 128728 | consumed samples: 24176 | consumed tokens: 49512448 | elapsed time per iteration (s): 15.21 | learning rate: 7.922E-06 | global batch size: 16 | lm loss: 6.917444E+00 | grad norm: 1.343 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1512/ 128728 | consumed samples: 24192 | consumed tokens: 49545216 | elapsed time per iteration (s): 15.21 | learning rate: 7.927E-06 | global batch size: 16 | lm loss: 6.704676E+00 | grad norm: 0.944 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1513/ 128728 | consumed samples: 24208 | consumed tokens: 49577984 | elapsed time per iteration (s): 15.21 | learning rate: 7.932E-06 | global batch size: 16 | lm loss: 6.625801E+00 | grad norm: 1.004 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1514/ 128728 | consumed samples: 24224 | consumed tokens: 49610752 | elapsed time per iteration (s): 15.22 | learning rate: 7.938E-06 | global batch size: 16 | lm loss: 6.983078E+00 | grad norm: 1.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1515/ 128728 | consumed samples: 24240 | consumed tokens: 49643520 | elapsed time per iteration (s): 15.16 | learning rate: 7.943E-06 | global batch size: 16 | lm loss: 6.767624E+00 | grad norm: 0.976 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 1516/ 128728 | consumed samples: 24256 | consumed tokens: 49676288 | elapsed time per iteration (s): 15.24 | learning rate: 7.948E-06 | global batch size: 16 | lm loss: 6.977883E+00 | grad norm: 0.917 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1517/ 128728 | consumed samples: 24272 | consumed tokens: 49709056 | elapsed time per iteration (s): 15.21 | learning rate: 7.953E-06 | global batch size: 16 | lm loss: 6.882602E+00 | grad norm: 1.063 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1518/ 128728 | consumed samples: 24288 | consumed tokens: 49741824 | elapsed time per iteration (s): 15.27 | learning rate: 7.959E-06 | global batch size: 16 | lm loss: 7.073375E+00 | grad norm: 0.847 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1519/ 128728 | consumed samples: 24304 | consumed tokens: 49774592 | elapsed time per iteration (s): 15.25 | learning rate: 7.964E-06 | global batch size: 16 | lm loss: 6.735807E+00 | grad norm: 1.008 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1520/ 128728 | consumed samples: 24320 | consumed tokens: 49807360 | elapsed time per iteration (s): 15.24 | learning rate: 7.969E-06 | global batch size: 16 | lm loss: 6.980803E+00 | grad norm: 1.043 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1521/ 128728 | consumed samples: 24336 | consumed tokens: 49840128 | elapsed time per iteration (s): 15.26 | learning rate: 7.974E-06 | global batch size: 16 | lm loss: 6.703127E+00 | grad norm: 0.801 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1522/ 128728 | consumed samples: 24352 | consumed tokens: 49872896 | elapsed time per iteration (s): 15.23 | learning rate: 7.980E-06 | global batch size: 16 | lm loss: 6.806160E+00 | grad norm: 0.934 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1523/ 128728 | consumed samples: 24368 | consumed tokens: 49905664 | elapsed time per iteration (s): 15.17 | learning rate: 7.985E-06 | global batch size: 16 | lm loss: 6.960247E+00 | grad norm: 0.846 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 1524/ 128728 | consumed samples: 24384 | consumed tokens: 49938432 | elapsed time per iteration (s): 15.19 | learning rate: 7.990E-06 | global batch size: 16 | lm loss: 6.941716E+00 | grad norm: 0.877 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 1525/ 128728 | consumed samples: 24400 | consumed tokens: 49971200 | elapsed time per iteration (s): 15.23 | learning rate: 7.995E-06 | global batch size: 16 | lm loss: 6.999331E+00 | grad norm: 0.911 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1526/ 128728 | consumed samples: 24416 | consumed tokens: 50003968 | elapsed time per iteration (s): 15.20 | learning rate: 8.001E-06 | global batch size: 16 | lm loss: 6.783548E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1527/ 128728 | consumed samples: 24432 | consumed tokens: 50036736 | elapsed time per iteration (s): 15.24 | learning rate: 8.006E-06 | global batch size: 16 | lm loss: 6.833918E+00 | grad norm: 0.779 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1528/ 128728 | consumed samples: 24448 | consumed tokens: 50069504 | elapsed time per iteration (s): 15.25 | learning rate: 8.011E-06 | global batch size: 16 | lm loss: 6.808972E+00 | grad norm: 0.864 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1529/ 128728 | consumed samples: 24464 | consumed tokens: 50102272 | elapsed time per iteration (s): 15.19 | learning rate: 8.016E-06 | global batch size: 16 | lm loss: 6.914598E+00 | grad norm: 0.959 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1530/ 128728 | consumed samples: 24480 | consumed tokens: 50135040 | elapsed time per iteration (s): 15.25 | learning rate: 8.022E-06 | global batch size: 16 | lm loss: 6.613030E+00 | grad norm: 1.005 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.04 | [default7]: iteration 1531/ 128728 | consumed samples: 24496 | consumed tokens: 50167808 | elapsed time per iteration (s): 15.26 | learning rate: 8.027E-06 | global batch size: 16 | lm loss: 6.960011E+00 | grad norm: 0.959 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1532/ 128728 | consumed samples: 24512 | consumed tokens: 50200576 | elapsed time per iteration (s): 15.22 | learning rate: 8.032E-06 | global batch size: 16 | lm loss: 6.928339E+00 | grad norm: 0.963 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1533/ 128728 | consumed samples: 24528 | consumed tokens: 50233344 | elapsed time per iteration (s): 15.21 | learning rate: 8.037E-06 | global batch size: 16 | lm loss: 6.872521E+00 | grad norm: 0.815 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1534/ 128728 | consumed samples: 24544 | consumed tokens: 50266112 | elapsed time per iteration (s): 15.22 | learning rate: 8.043E-06 | global batch size: 16 | lm loss: 7.037945E+00 | grad norm: 0.853 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1535/ 128728 | consumed samples: 24560 | consumed tokens: 50298880 | elapsed time per iteration (s): 15.22 | learning rate: 8.048E-06 | global batch size: 16 | lm loss: 6.908956E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1536/ 128728 | consumed samples: 24576 | consumed tokens: 50331648 | elapsed time per iteration (s): 15.22 | learning rate: 8.053E-06 | global batch size: 16 | lm loss: 6.830132E+00 | grad norm: 0.874 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1537/ 128728 | consumed samples: 24592 | consumed tokens: 50364416 | elapsed time per iteration (s): 15.22 | learning rate: 8.058E-06 | global batch size: 16 | lm loss: 7.005225E+00 | grad norm: 0.901 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1538/ 128728 | consumed samples: 24608 | consumed tokens: 50397184 | elapsed time per iteration (s): 15.24 | learning rate: 8.064E-06 | global batch size: 16 | lm loss: 6.873813E+00 | grad norm: 1.114 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1539/ 128728 | consumed samples: 24624 | consumed tokens: 50429952 | elapsed time per iteration (s): 15.24 | learning rate: 8.069E-06 | global batch size: 16 | lm loss: 7.034050E+00 | grad norm: 1.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1540/ 128728 | consumed samples: 24640 | consumed tokens: 50462720 | elapsed time per iteration (s): 15.20 | learning rate: 8.074E-06 | global batch size: 16 | lm loss: 6.716762E+00 | grad norm: 0.977 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1541/ 128728 | consumed samples: 24656 | consumed tokens: 50495488 | elapsed time per iteration (s): 15.23 | learning rate: 8.079E-06 | global batch size: 16 | lm loss: 6.718003E+00 | grad norm: 0.921 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1542/ 128728 | consumed samples: 24672 | consumed tokens: 50528256 | elapsed time per iteration (s): 15.20 | learning rate: 8.085E-06 | global batch size: 16 | lm loss: 6.716470E+00 | grad norm: 0.816 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1543/ 128728 | consumed samples: 24688 | consumed tokens: 50561024 | elapsed time per iteration (s): 15.25 | learning rate: 8.090E-06 | global batch size: 16 | lm loss: 6.932936E+00 | grad norm: 1.057 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.04 | [default7]: iteration 1544/ 128728 | consumed samples: 24704 | consumed tokens: 50593792 | elapsed time per iteration (s): 15.25 | learning rate: 8.095E-06 | global batch size: 16 | lm loss: 6.767054E+00 | grad norm: 0.832 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1545/ 128728 | consumed samples: 24720 | consumed tokens: 50626560 | elapsed time per iteration (s): 15.24 | learning rate: 8.100E-06 | global batch size: 16 | lm loss: 6.696455E+00 | grad norm: 0.799 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1546/ 128728 | consumed samples: 24736 | consumed tokens: 50659328 | elapsed time per iteration (s): 15.23 | learning rate: 8.106E-06 | global batch size: 16 | lm loss: 6.987879E+00 | grad norm: 0.974 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1547/ 128728 | consumed samples: 24752 | consumed tokens: 50692096 | elapsed time per iteration (s): 15.30 | learning rate: 8.111E-06 | global batch size: 16 | lm loss: 6.568095E+00 | grad norm: 0.816 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.045 | TFLOPs: 8.00 | [default7]: iteration 1548/ 128728 | consumed samples: 24768 | consumed tokens: 50724864 | elapsed time per iteration (s): 15.21 | learning rate: 8.116E-06 | global batch size: 16 | lm loss: 6.986506E+00 | grad norm: 0.860 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1549/ 128728 | consumed samples: 24784 | consumed tokens: 50757632 | elapsed time per iteration (s): 15.24 | learning rate: 8.121E-06 | global batch size: 16 | lm loss: 7.040531E+00 | grad norm: 0.878 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1550/ 128728 | consumed samples: 24800 | consumed tokens: 50790400 | elapsed time per iteration (s): 15.20 | learning rate: 8.126E-06 | global batch size: 16 | lm loss: 6.698561E+00 | grad norm: 0.857 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1551/ 128728 | consumed samples: 24816 | consumed tokens: 50823168 | elapsed time per iteration (s): 15.23 | learning rate: 8.132E-06 | global batch size: 16 | lm loss: 6.824203E+00 | grad norm: 0.799 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1552/ 128728 | consumed samples: 24832 | consumed tokens: 50855936 | elapsed time per iteration (s): 15.28 | learning rate: 8.137E-06 | global batch size: 16 | lm loss: 6.724894E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 1553/ 128728 | consumed samples: 24848 | consumed tokens: 50888704 | elapsed time per iteration (s): 15.22 | learning rate: 8.142E-06 | global batch size: 16 | lm loss: 6.692251E+00 | grad norm: 0.960 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1554/ 128728 | consumed samples: 24864 | consumed tokens: 50921472 | elapsed time per iteration (s): 15.23 | learning rate: 8.147E-06 | global batch size: 16 | lm loss: 6.816679E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1555/ 128728 | consumed samples: 24880 | consumed tokens: 50954240 | elapsed time per iteration (s): 15.18 | learning rate: 8.153E-06 | global batch size: 16 | lm loss: 6.784638E+00 | grad norm: 0.902 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 1556/ 128728 | consumed samples: 24896 | consumed tokens: 50987008 | elapsed time per iteration (s): 15.25 | learning rate: 8.158E-06 | global batch size: 16 | lm loss: 7.072264E+00 | grad norm: 1.032 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1557/ 128728 | consumed samples: 24912 | consumed tokens: 51019776 | elapsed time per iteration (s): 15.25 | learning rate: 8.163E-06 | global batch size: 16 | lm loss: 7.026040E+00 | grad norm: 1.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1558/ 128728 | consumed samples: 24928 | consumed tokens: 51052544 | elapsed time per iteration (s): 15.25 | learning rate: 8.168E-06 | global batch size: 16 | lm loss: 6.760884E+00 | grad norm: 1.059 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1559/ 128728 | consumed samples: 24944 | consumed tokens: 51085312 | elapsed time per iteration (s): 15.24 | learning rate: 8.174E-06 | global batch size: 16 | lm loss: 6.945187E+00 | grad norm: 3.044 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1560/ 128728 | consumed samples: 24960 | consumed tokens: 51118080 | elapsed time per iteration (s): 15.24 | learning rate: 8.179E-06 | global batch size: 16 | lm loss: 6.917427E+00 | grad norm: 1.624 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1561/ 128728 | consumed samples: 24976 | consumed tokens: 51150848 | elapsed time per iteration (s): 15.23 | learning rate: 8.184E-06 | global batch size: 16 | lm loss: 6.880846E+00 | grad norm: 0.868 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1562/ 128728 | consumed samples: 24992 | consumed tokens: 51183616 | elapsed time per iteration (s): 15.25 | learning rate: 8.189E-06 | global batch size: 16 | lm loss: 6.682335E+00 | grad norm: 1.081 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1563/ 128728 | consumed samples: 25008 | consumed tokens: 51216384 | elapsed time per iteration (s): 15.20 | learning rate: 8.195E-06 | global batch size: 16 | lm loss: 6.699176E+00 | grad norm: 0.925 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1564/ 128728 | consumed samples: 25024 | consumed tokens: 51249152 | elapsed time per iteration (s): 15.25 | learning rate: 8.200E-06 | global batch size: 16 | lm loss: 7.053262E+00 | grad norm: 1.216 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1565/ 128728 | consumed samples: 25040 | consumed tokens: 51281920 | elapsed time per iteration (s): 15.21 | learning rate: 8.205E-06 | global batch size: 16 | lm loss: 6.849342E+00 | grad norm: 0.880 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1566/ 128728 | consumed samples: 25056 | consumed tokens: 51314688 | elapsed time per iteration (s): 15.23 | learning rate: 8.210E-06 | global batch size: 16 | lm loss: 6.907884E+00 | grad norm: 0.839 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1567/ 128728 | consumed samples: 25072 | consumed tokens: 51347456 | elapsed time per iteration (s): 15.24 | learning rate: 8.216E-06 | global batch size: 16 | lm loss: 6.791646E+00 | grad norm: 0.953 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1568/ 128728 | consumed samples: 25088 | consumed tokens: 51380224 | elapsed time per iteration (s): 15.23 | learning rate: 8.221E-06 | global batch size: 16 | lm loss: 6.826554E+00 | grad norm: 0.810 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1569/ 128728 | consumed samples: 25104 | consumed tokens: 51412992 | elapsed time per iteration (s): 15.17 | learning rate: 8.226E-06 | global batch size: 16 | lm loss: 6.818520E+00 | grad norm: 0.912 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 1570/ 128728 | consumed samples: 25120 | consumed tokens: 51445760 | elapsed time per iteration (s): 15.30 | learning rate: 8.231E-06 | global batch size: 16 | lm loss: 6.837758E+00 | grad norm: 0.975 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.046 | TFLOPs: 8.01 | [default7]: iteration 1571/ 128728 | consumed samples: 25136 | consumed tokens: 51478528 | elapsed time per iteration (s): 15.24 | learning rate: 8.237E-06 | global batch size: 16 | lm loss: 6.881770E+00 | grad norm: 0.963 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1572/ 128728 | consumed samples: 25152 | consumed tokens: 51511296 | elapsed time per iteration (s): 15.22 | learning rate: 8.242E-06 | global batch size: 16 | lm loss: 6.722608E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1573/ 128728 | consumed samples: 25168 | consumed tokens: 51544064 | elapsed time per iteration (s): 15.24 | learning rate: 8.247E-06 | global batch size: 16 | lm loss: 6.414332E+00 | grad norm: 0.920 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1574/ 128728 | consumed samples: 25184 | consumed tokens: 51576832 | elapsed time per iteration (s): 15.19 | learning rate: 8.252E-06 | global batch size: 16 | lm loss: 6.807733E+00 | grad norm: 0.771 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 1575/ 128728 | consumed samples: 25200 | consumed tokens: 51609600 | elapsed time per iteration (s): 15.24 | learning rate: 8.258E-06 | global batch size: 16 | lm loss: 6.740201E+00 | grad norm: 0.854 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1576/ 128728 | consumed samples: 25216 | consumed tokens: 51642368 | elapsed time per iteration (s): 15.26 | learning rate: 8.263E-06 | global batch size: 16 | lm loss: 7.032575E+00 | grad norm: 1.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1577/ 128728 | consumed samples: 25232 | consumed tokens: 51675136 | elapsed time per iteration (s): 15.27 | learning rate: 8.268E-06 | global batch size: 16 | lm loss: 6.839057E+00 | grad norm: 0.950 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1578/ 128728 | consumed samples: 25248 | consumed tokens: 51707904 | elapsed time per iteration (s): 15.24 | learning rate: 8.273E-06 | global batch size: 16 | lm loss: 7.019974E+00 | grad norm: 3.045 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1579/ 128728 | consumed samples: 25264 | consumed tokens: 51740672 | elapsed time per iteration (s): 15.26 | learning rate: 8.279E-06 | global batch size: 16 | lm loss: 6.777742E+00 | grad norm: 1.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1580/ 128728 | consumed samples: 25280 | consumed tokens: 51773440 | elapsed time per iteration (s): 15.22 | learning rate: 8.284E-06 | global batch size: 16 | lm loss: 6.943933E+00 | grad norm: 0.927 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1581/ 128728 | consumed samples: 25296 | consumed tokens: 51806208 | elapsed time per iteration (s): 15.25 | learning rate: 8.289E-06 | global batch size: 16 | lm loss: 6.761483E+00 | grad norm: 0.842 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1582/ 128728 | consumed samples: 25312 | consumed tokens: 51838976 | elapsed time per iteration (s): 15.19 | learning rate: 8.294E-06 | global batch size: 16 | lm loss: 6.671811E+00 | grad norm: 0.934 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1583/ 128728 | consumed samples: 25328 | consumed tokens: 51871744 | elapsed time per iteration (s): 15.22 | learning rate: 8.300E-06 | global batch size: 16 | lm loss: 6.679467E+00 | grad norm: 0.795 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1584/ 128728 | consumed samples: 25344 | consumed tokens: 51904512 | elapsed time per iteration (s): 15.23 | learning rate: 8.305E-06 | global batch size: 16 | lm loss: 6.834284E+00 | grad norm: 0.859 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1585/ 128728 | consumed samples: 25360 | consumed tokens: 51937280 | elapsed time per iteration (s): 15.22 | learning rate: 8.310E-06 | global batch size: 16 | lm loss: 6.797321E+00 | grad norm: 0.950 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1586/ 128728 | consumed samples: 25376 | consumed tokens: 51970048 | elapsed time per iteration (s): 15.17 | learning rate: 8.315E-06 | global batch size: 16 | lm loss: 6.856018E+00 | grad norm: 0.918 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 1587/ 128728 | consumed samples: 25392 | consumed tokens: 52002816 | elapsed time per iteration (s): 15.17 | learning rate: 8.320E-06 | global batch size: 16 | lm loss: 6.861340E+00 | grad norm: 0.942 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 1588/ 128728 | consumed samples: 25408 | consumed tokens: 52035584 | elapsed time per iteration (s): 15.22 | learning rate: 8.326E-06 | global batch size: 16 | lm loss: 6.824626E+00 | grad norm: 0.837 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1589/ 128728 | consumed samples: 25424 | consumed tokens: 52068352 | elapsed time per iteration (s): 15.15 | learning rate: 8.331E-06 | global batch size: 16 | lm loss: 6.807738E+00 | grad norm: 0.814 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 1590/ 128728 | consumed samples: 25440 | consumed tokens: 52101120 | elapsed time per iteration (s): 15.21 | learning rate: 8.336E-06 | global batch size: 16 | lm loss: 6.876394E+00 | grad norm: 0.772 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1591/ 128728 | consumed samples: 25456 | consumed tokens: 52133888 | elapsed time per iteration (s): 15.23 | learning rate: 8.341E-06 | global batch size: 16 | lm loss: 6.738904E+00 | grad norm: 0.927 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1592/ 128728 | consumed samples: 25472 | consumed tokens: 52166656 | elapsed time per iteration (s): 15.21 | learning rate: 8.347E-06 | global batch size: 16 | lm loss: 6.556417E+00 | grad norm: 0.868 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1593/ 128728 | consumed samples: 25488 | consumed tokens: 52199424 | elapsed time per iteration (s): 15.17 | learning rate: 8.352E-06 | global batch size: 16 | lm loss: 6.714180E+00 | grad norm: 0.771 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 1594/ 128728 | consumed samples: 25504 | consumed tokens: 52232192 | elapsed time per iteration (s): 15.18 | learning rate: 8.357E-06 | global batch size: 16 | lm loss: 6.833164E+00 | grad norm: 0.882 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 1595/ 128728 | consumed samples: 25520 | consumed tokens: 52264960 | elapsed time per iteration (s): 15.27 | learning rate: 8.362E-06 | global batch size: 16 | lm loss: 6.748915E+00 | grad norm: 1.107 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1596/ 128728 | consumed samples: 25536 | consumed tokens: 52297728 | elapsed time per iteration (s): 15.21 | learning rate: 8.368E-06 | global batch size: 16 | lm loss: 6.567333E+00 | grad norm: 0.800 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1597/ 128728 | consumed samples: 25552 | consumed tokens: 52330496 | elapsed time per iteration (s): 15.18 | learning rate: 8.373E-06 | global batch size: 16 | lm loss: 6.716132E+00 | grad norm: 0.983 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 1598/ 128728 | consumed samples: 25568 | consumed tokens: 52363264 | elapsed time per iteration (s): 15.18 | learning rate: 8.378E-06 | global batch size: 16 | lm loss: 7.036856E+00 | grad norm: 0.958 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 1599/ 128728 | consumed samples: 25584 | consumed tokens: 52396032 | elapsed time per iteration (s): 15.26 | learning rate: 8.383E-06 | global batch size: 16 | lm loss: 6.838940E+00 | grad norm: 1.007 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1600/ 128728 | consumed samples: 25600 | consumed tokens: 52428800 | elapsed time per iteration (s): 15.22 | learning rate: 8.389E-06 | global batch size: 16 | lm loss: 6.934296E+00 | grad norm: 0.969 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1601/ 128728 | consumed samples: 25616 | consumed tokens: 52461568 | elapsed time per iteration (s): 15.20 | learning rate: 8.394E-06 | global batch size: 16 | lm loss: 7.009863E+00 | grad norm: 0.855 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1602/ 128728 | consumed samples: 25632 | consumed tokens: 52494336 | elapsed time per iteration (s): 15.22 | learning rate: 8.399E-06 | global batch size: 16 | lm loss: 6.868528E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1603/ 128728 | consumed samples: 25648 | consumed tokens: 52527104 | elapsed time per iteration (s): 15.26 | learning rate: 8.404E-06 | global batch size: 16 | lm loss: 6.692941E+00 | grad norm: 1.073 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1604/ 128728 | consumed samples: 25664 | consumed tokens: 52559872 | elapsed time per iteration (s): 15.25 | learning rate: 8.410E-06 | global batch size: 16 | lm loss: 6.775326E+00 | grad norm: 0.780 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1605/ 128728 | consumed samples: 25680 | consumed tokens: 52592640 | elapsed time per iteration (s): 15.23 | learning rate: 8.415E-06 | global batch size: 16 | lm loss: 6.836594E+00 | grad norm: 0.889 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1606/ 128728 | consumed samples: 25696 | consumed tokens: 52625408 | elapsed time per iteration (s): 15.19 | learning rate: 8.420E-06 | global batch size: 16 | lm loss: 6.700777E+00 | grad norm: 0.848 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1607/ 128728 | consumed samples: 25712 | consumed tokens: 52658176 | elapsed time per iteration (s): 15.22 | learning rate: 8.425E-06 | global batch size: 16 | lm loss: 6.842509E+00 | grad norm: 0.855 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1608/ 128728 | consumed samples: 25728 | consumed tokens: 52690944 | elapsed time per iteration (s): 15.22 | learning rate: 8.431E-06 | global batch size: 16 | lm loss: 6.609758E+00 | grad norm: 0.938 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1609/ 128728 | consumed samples: 25744 | consumed tokens: 52723712 | elapsed time per iteration (s): 15.24 | learning rate: 8.436E-06 | global batch size: 16 | lm loss: 6.705388E+00 | grad norm: 0.792 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1610/ 128728 | consumed samples: 25760 | consumed tokens: 52756480 | elapsed time per iteration (s): 15.22 | learning rate: 8.441E-06 | global batch size: 16 | lm loss: 7.225027E+00 | grad norm: 0.921 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1611/ 128728 | consumed samples: 25776 | consumed tokens: 52789248 | elapsed time per iteration (s): 15.23 | learning rate: 8.446E-06 | global batch size: 16 | lm loss: 6.473947E+00 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1612/ 128728 | consumed samples: 25792 | consumed tokens: 52822016 | elapsed time per iteration (s): 15.22 | learning rate: 8.452E-06 | global batch size: 16 | lm loss: 6.753922E+00 | grad norm: 0.966 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1613/ 128728 | consumed samples: 25808 | consumed tokens: 52854784 | elapsed time per iteration (s): 15.22 | learning rate: 8.457E-06 | global batch size: 16 | lm loss: 6.546061E+00 | grad norm: 0.853 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1614/ 128728 | consumed samples: 25824 | consumed tokens: 52887552 | elapsed time per iteration (s): 15.21 | learning rate: 8.462E-06 | global batch size: 16 | lm loss: 6.621816E+00 | grad norm: 0.792 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1615/ 128728 | consumed samples: 25840 | consumed tokens: 52920320 | elapsed time per iteration (s): 15.24 | learning rate: 8.467E-06 | global batch size: 16 | lm loss: 6.808933E+00 | grad norm: 0.813 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1616/ 128728 | consumed samples: 25856 | consumed tokens: 52953088 | elapsed time per iteration (s): 15.24 | learning rate: 8.473E-06 | global batch size: 16 | lm loss: 6.900961E+00 | grad norm: 0.795 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1617/ 128728 | consumed samples: 25872 | consumed tokens: 52985856 | elapsed time per iteration (s): 15.26 | learning rate: 8.478E-06 | global batch size: 16 | lm loss: 6.817991E+00 | grad norm: 0.726 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1618/ 128728 | consumed samples: 25888 | consumed tokens: 53018624 | elapsed time per iteration (s): 15.24 | learning rate: 8.483E-06 | global batch size: 16 | lm loss: 6.795853E+00 | grad norm: 0.782 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1619/ 128728 | consumed samples: 25904 | consumed tokens: 53051392 | elapsed time per iteration (s): 15.30 | learning rate: 8.488E-06 | global batch size: 16 | lm loss: 6.808154E+00 | grad norm: 0.788 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.046 | TFLOPs: 8.01 | [default7]: iteration 1620/ 128728 | consumed samples: 25920 | consumed tokens: 53084160 | elapsed time per iteration (s): 15.22 | learning rate: 8.493E-06 | global batch size: 16 | lm loss: 6.756622E+00 | grad norm: 0.766 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1621/ 128728 | consumed samples: 25936 | consumed tokens: 53116928 | elapsed time per iteration (s): 15.24 | learning rate: 8.499E-06 | global batch size: 16 | lm loss: 6.731472E+00 | grad norm: 0.821 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1622/ 128728 | consumed samples: 25952 | consumed tokens: 53149696 | elapsed time per iteration (s): 15.25 | learning rate: 8.504E-06 | global batch size: 16 | lm loss: 6.762363E+00 | grad norm: 0.924 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1623/ 128728 | consumed samples: 25968 | consumed tokens: 53182464 | elapsed time per iteration (s): 15.22 | learning rate: 8.509E-06 | global batch size: 16 | lm loss: 6.635185E+00 | grad norm: 1.285 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1624/ 128728 | consumed samples: 25984 | consumed tokens: 53215232 | elapsed time per iteration (s): 15.24 | learning rate: 8.514E-06 | global batch size: 16 | lm loss: 6.719479E+00 | grad norm: 0.859 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1625/ 128728 | consumed samples: 26000 | consumed tokens: 53248000 | elapsed time per iteration (s): 15.20 | learning rate: 8.520E-06 | global batch size: 16 | lm loss: 6.719177E+00 | grad norm: 0.903 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1626/ 128728 | consumed samples: 26016 | consumed tokens: 53280768 | elapsed time per iteration (s): 15.23 | learning rate: 8.525E-06 | global batch size: 16 | lm loss: 6.717042E+00 | grad norm: 0.889 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1627/ 128728 | consumed samples: 26032 | consumed tokens: 53313536 | elapsed time per iteration (s): 15.20 | learning rate: 8.530E-06 | global batch size: 16 | lm loss: 6.696861E+00 | grad norm: 0.851 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1628/ 128728 | consumed samples: 26048 | consumed tokens: 53346304 | elapsed time per iteration (s): 15.23 | learning rate: 8.535E-06 | global batch size: 16 | lm loss: 6.720668E+00 | grad norm: 0.949 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1629/ 128728 | consumed samples: 26064 | consumed tokens: 53379072 | elapsed time per iteration (s): 15.21 | learning rate: 8.541E-06 | global batch size: 16 | lm loss: 6.852234E+00 | grad norm: 0.820 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1630/ 128728 | consumed samples: 26080 | consumed tokens: 53411840 | elapsed time per iteration (s): 15.23 | learning rate: 8.546E-06 | global batch size: 16 | lm loss: 6.824490E+00 | grad norm: 1.112 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1631/ 128728 | consumed samples: 26096 | consumed tokens: 53444608 | elapsed time per iteration (s): 15.23 | learning rate: 8.551E-06 | global batch size: 16 | lm loss: 6.849283E+00 | grad norm: 0.954 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1632/ 128728 | consumed samples: 26112 | consumed tokens: 53477376 | elapsed time per iteration (s): 15.22 | learning rate: 8.556E-06 | global batch size: 16 | lm loss: 6.797266E+00 | grad norm: 0.853 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1633/ 128728 | consumed samples: 26128 | consumed tokens: 53510144 | elapsed time per iteration (s): 15.24 | learning rate: 8.562E-06 | global batch size: 16 | lm loss: 6.806245E+00 | grad norm: 0.881 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1634/ 128728 | consumed samples: 26144 | consumed tokens: 53542912 | elapsed time per iteration (s): 15.23 | learning rate: 8.567E-06 | global batch size: 16 | lm loss: 6.727156E+00 | grad norm: 0.802 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1635/ 128728 | consumed samples: 26160 | consumed tokens: 53575680 | elapsed time per iteration (s): 15.22 | learning rate: 8.572E-06 | global batch size: 16 | lm loss: 6.681533E+00 | grad norm: 0.823 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1636/ 128728 | consumed samples: 26176 | consumed tokens: 53608448 | elapsed time per iteration (s): 15.24 | learning rate: 8.577E-06 | global batch size: 16 | lm loss: 6.698471E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1637/ 128728 | consumed samples: 26192 | consumed tokens: 53641216 | elapsed time per iteration (s): 15.22 | learning rate: 8.583E-06 | global batch size: 16 | lm loss: 6.622070E+00 | grad norm: 0.901 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1638/ 128728 | consumed samples: 26208 | consumed tokens: 53673984 | elapsed time per iteration (s): 15.21 | learning rate: 8.588E-06 | global batch size: 16 | lm loss: 6.974732E+00 | grad norm: 0.849 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1639/ 128728 | consumed samples: 26224 | consumed tokens: 53706752 | elapsed time per iteration (s): 15.23 | learning rate: 8.593E-06 | global batch size: 16 | lm loss: 6.687452E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1640/ 128728 | consumed samples: 26240 | consumed tokens: 53739520 | elapsed time per iteration (s): 15.24 | learning rate: 8.598E-06 | global batch size: 16 | lm loss: 6.786465E+00 | grad norm: 1.361 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1641/ 128728 | consumed samples: 26256 | consumed tokens: 53772288 | elapsed time per iteration (s): 15.21 | learning rate: 8.604E-06 | global batch size: 16 | lm loss: 6.489959E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1642/ 128728 | consumed samples: 26272 | consumed tokens: 53805056 | elapsed time per iteration (s): 15.22 | learning rate: 8.609E-06 | global batch size: 16 | lm loss: 6.655648E+00 | grad norm: 0.837 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1643/ 128728 | consumed samples: 26288 | consumed tokens: 53837824 | elapsed time per iteration (s): 15.23 | learning rate: 8.614E-06 | global batch size: 16 | lm loss: 6.969821E+00 | grad norm: 0.851 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1644/ 128728 | consumed samples: 26304 | consumed tokens: 53870592 | elapsed time per iteration (s): 15.20 | learning rate: 8.619E-06 | global batch size: 16 | lm loss: 6.872509E+00 | grad norm: 0.846 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1645/ 128728 | consumed samples: 26320 | consumed tokens: 53903360 | elapsed time per iteration (s): 15.19 | learning rate: 8.625E-06 | global batch size: 16 | lm loss: 6.742445E+00 | grad norm: 0.941 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1646/ 128728 | consumed samples: 26336 | consumed tokens: 53936128 | elapsed time per iteration (s): 15.22 | learning rate: 8.630E-06 | global batch size: 16 | lm loss: 6.764326E+00 | grad norm: 0.845 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1647/ 128728 | consumed samples: 26352 | consumed tokens: 53968896 | elapsed time per iteration (s): 15.22 | learning rate: 8.635E-06 | global batch size: 16 | lm loss: 6.716838E+00 | grad norm: 1.019 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1648/ 128728 | consumed samples: 26368 | consumed tokens: 54001664 | elapsed time per iteration (s): 15.24 | learning rate: 8.640E-06 | global batch size: 16 | lm loss: 6.780625E+00 | grad norm: 1.095 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1649/ 128728 | consumed samples: 26384 | consumed tokens: 54034432 | elapsed time per iteration (s): 15.22 | learning rate: 8.646E-06 | global batch size: 16 | lm loss: 6.868196E+00 | grad norm: 0.818 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1650/ 128728 | consumed samples: 26400 | consumed tokens: 54067200 | elapsed time per iteration (s): 15.23 | learning rate: 8.651E-06 | global batch size: 16 | lm loss: 6.551585E+00 | grad norm: 0.823 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1651/ 128728 | consumed samples: 26416 | consumed tokens: 54099968 | elapsed time per iteration (s): 15.21 | learning rate: 8.656E-06 | global batch size: 16 | lm loss: 6.747394E+00 | grad norm: 0.803 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1652/ 128728 | consumed samples: 26432 | consumed tokens: 54132736 | elapsed time per iteration (s): 15.24 | learning rate: 8.661E-06 | global batch size: 16 | lm loss: 6.589448E+00 | grad norm: 0.779 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1653/ 128728 | consumed samples: 26448 | consumed tokens: 54165504 | elapsed time per iteration (s): 15.23 | learning rate: 8.667E-06 | global batch size: 16 | lm loss: 6.797907E+00 | grad norm: 0.803 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1654/ 128728 | consumed samples: 26464 | consumed tokens: 54198272 | elapsed time per iteration (s): 15.23 | learning rate: 8.672E-06 | global batch size: 16 | lm loss: 6.482568E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1655/ 128728 | consumed samples: 26480 | consumed tokens: 54231040 | elapsed time per iteration (s): 15.23 | learning rate: 8.677E-06 | global batch size: 16 | lm loss: 6.558523E+00 | grad norm: 0.859 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1656/ 128728 | consumed samples: 26496 | consumed tokens: 54263808 | elapsed time per iteration (s): 15.22 | learning rate: 8.682E-06 | global batch size: 16 | lm loss: 6.789701E+00 | grad norm: 0.834 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1657/ 128728 | consumed samples: 26512 | consumed tokens: 54296576 | elapsed time per iteration (s): 15.21 | learning rate: 8.687E-06 | global batch size: 16 | lm loss: 6.682187E+00 | grad norm: 0.802 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1658/ 128728 | consumed samples: 26528 | consumed tokens: 54329344 | elapsed time per iteration (s): 15.25 | learning rate: 8.693E-06 | global batch size: 16 | lm loss: 6.815564E+00 | grad norm: 0.765 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1659/ 128728 | consumed samples: 26544 | consumed tokens: 54362112 | elapsed time per iteration (s): 15.22 | learning rate: 8.698E-06 | global batch size: 16 | lm loss: 6.569890E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1660/ 128728 | consumed samples: 26560 | consumed tokens: 54394880 | elapsed time per iteration (s): 15.22 | learning rate: 8.703E-06 | global batch size: 16 | lm loss: 6.928395E+00 | grad norm: 1.473 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1661/ 128728 | consumed samples: 26576 | consumed tokens: 54427648 | elapsed time per iteration (s): 15.24 | learning rate: 8.708E-06 | global batch size: 16 | lm loss: 6.680755E+00 | grad norm: 0.716 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1662/ 128728 | consumed samples: 26592 | consumed tokens: 54460416 | elapsed time per iteration (s): 15.25 | learning rate: 8.714E-06 | global batch size: 16 | lm loss: 6.746358E+00 | grad norm: 0.820 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1663/ 128728 | consumed samples: 26608 | consumed tokens: 54493184 | elapsed time per iteration (s): 15.20 | learning rate: 8.719E-06 | global batch size: 16 | lm loss: 6.800119E+00 | grad norm: 0.920 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1664/ 128728 | consumed samples: 26624 | consumed tokens: 54525952 | elapsed time per iteration (s): 15.21 | learning rate: 8.724E-06 | global batch size: 16 | lm loss: 6.989696E+00 | grad norm: 0.926 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1665/ 128728 | consumed samples: 26640 | consumed tokens: 54558720 | elapsed time per iteration (s): 15.22 | learning rate: 8.729E-06 | global batch size: 16 | lm loss: 6.610164E+00 | grad norm: 0.990 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1666/ 128728 | consumed samples: 26656 | consumed tokens: 54591488 | elapsed time per iteration (s): 15.24 | learning rate: 8.735E-06 | global batch size: 16 | lm loss: 6.775718E+00 | grad norm: 0.934 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1667/ 128728 | consumed samples: 26672 | consumed tokens: 54624256 | elapsed time per iteration (s): 15.21 | learning rate: 8.740E-06 | global batch size: 16 | lm loss: 6.772297E+00 | grad norm: 0.880 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1668/ 128728 | consumed samples: 26688 | consumed tokens: 54657024 | elapsed time per iteration (s): 15.21 | learning rate: 8.745E-06 | global batch size: 16 | lm loss: 6.742491E+00 | grad norm: 0.979 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1669/ 128728 | consumed samples: 26704 | consumed tokens: 54689792 | elapsed time per iteration (s): 15.21 | learning rate: 8.750E-06 | global batch size: 16 | lm loss: 6.486816E+00 | grad norm: 0.810 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1670/ 128728 | consumed samples: 26720 | consumed tokens: 54722560 | elapsed time per iteration (s): 15.15 | learning rate: 8.756E-06 | global batch size: 16 | lm loss: 6.712128E+00 | grad norm: 0.755 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 1671/ 128728 | consumed samples: 26736 | consumed tokens: 54755328 | elapsed time per iteration (s): 15.20 | learning rate: 8.761E-06 | global batch size: 16 | lm loss: 6.731301E+00 | grad norm: 0.820 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1672/ 128728 | consumed samples: 26752 | consumed tokens: 54788096 | elapsed time per iteration (s): 15.22 | learning rate: 8.766E-06 | global batch size: 16 | lm loss: 6.672599E+00 | grad norm: 0.925 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1673/ 128728 | consumed samples: 26768 | consumed tokens: 54820864 | elapsed time per iteration (s): 15.25 | learning rate: 8.771E-06 | global batch size: 16 | lm loss: 6.849351E+00 | grad norm: 0.960 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1674/ 128728 | consumed samples: 26784 | consumed tokens: 54853632 | elapsed time per iteration (s): 15.23 | learning rate: 8.777E-06 | global batch size: 16 | lm loss: 6.601808E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1675/ 128728 | consumed samples: 26800 | consumed tokens: 54886400 | elapsed time per iteration (s): 15.18 | learning rate: 8.782E-06 | global batch size: 16 | lm loss: 6.788216E+00 | grad norm: 0.897 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 1676/ 128728 | consumed samples: 26816 | consumed tokens: 54919168 | elapsed time per iteration (s): 15.23 | learning rate: 8.787E-06 | global batch size: 16 | lm loss: 6.842864E+00 | grad norm: 1.324 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1677/ 128728 | consumed samples: 26832 | consumed tokens: 54951936 | elapsed time per iteration (s): 15.24 | learning rate: 8.792E-06 | global batch size: 16 | lm loss: 6.575851E+00 | grad norm: 0.934 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1678/ 128728 | consumed samples: 26848 | consumed tokens: 54984704 | elapsed time per iteration (s): 15.25 | learning rate: 8.798E-06 | global batch size: 16 | lm loss: 6.952417E+00 | grad norm: 1.096 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1679/ 128728 | consumed samples: 26864 | consumed tokens: 55017472 | elapsed time per iteration (s): 15.24 | learning rate: 8.803E-06 | global batch size: 16 | lm loss: 6.949244E+00 | grad norm: 0.932 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1680/ 128728 | consumed samples: 26880 | consumed tokens: 55050240 | elapsed time per iteration (s): 15.18 | learning rate: 8.808E-06 | global batch size: 16 | lm loss: 6.862979E+00 | grad norm: 1.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 1681/ 128728 | consumed samples: 26896 | consumed tokens: 55083008 | elapsed time per iteration (s): 15.22 | learning rate: 8.813E-06 | global batch size: 16 | lm loss: 6.461435E+00 | grad norm: 0.869 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1682/ 128728 | consumed samples: 26912 | consumed tokens: 55115776 | elapsed time per iteration (s): 15.21 | learning rate: 8.819E-06 | global batch size: 16 | lm loss: 6.786581E+00 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1683/ 128728 | consumed samples: 26928 | consumed tokens: 55148544 | elapsed time per iteration (s): 15.27 | learning rate: 8.824E-06 | global batch size: 16 | lm loss: 6.698905E+00 | grad norm: 1.028 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1684/ 128728 | consumed samples: 26944 | consumed tokens: 55181312 | elapsed time per iteration (s): 15.25 | learning rate: 8.829E-06 | global batch size: 16 | lm loss: 6.586331E+00 | grad norm: 0.872 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1685/ 128728 | consumed samples: 26960 | consumed tokens: 55214080 | elapsed time per iteration (s): 15.14 | learning rate: 8.834E-06 | global batch size: 16 | lm loss: 6.709742E+00 | grad norm: 0.892 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 1686/ 128728 | consumed samples: 26976 | consumed tokens: 55246848 | elapsed time per iteration (s): 15.17 | learning rate: 8.840E-06 | global batch size: 16 | lm loss: 6.868404E+00 | grad norm: 0.817 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 1687/ 128728 | consumed samples: 26992 | consumed tokens: 55279616 | elapsed time per iteration (s): 15.28 | learning rate: 8.845E-06 | global batch size: 16 | lm loss: 6.532902E+00 | grad norm: 1.049 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 1688/ 128728 | consumed samples: 27008 | consumed tokens: 55312384 | elapsed time per iteration (s): 15.24 | learning rate: 8.850E-06 | global batch size: 16 | lm loss: 6.821950E+00 | grad norm: 0.845 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1689/ 128728 | consumed samples: 27024 | consumed tokens: 55345152 | elapsed time per iteration (s): 15.24 | learning rate: 8.855E-06 | global batch size: 16 | lm loss: 6.559717E+00 | grad norm: 0.927 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1690/ 128728 | consumed samples: 27040 | consumed tokens: 55377920 | elapsed time per iteration (s): 15.23 | learning rate: 8.860E-06 | global batch size: 16 | lm loss: 6.734614E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1691/ 128728 | consumed samples: 27056 | consumed tokens: 55410688 | elapsed time per iteration (s): 15.23 | learning rate: 8.866E-06 | global batch size: 16 | lm loss: 6.887871E+00 | grad norm: 0.843 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1692/ 128728 | consumed samples: 27072 | consumed tokens: 55443456 | elapsed time per iteration (s): 15.23 | learning rate: 8.871E-06 | global batch size: 16 | lm loss: 6.599645E+00 | grad norm: 0.831 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1693/ 128728 | consumed samples: 27088 | consumed tokens: 55476224 | elapsed time per iteration (s): 15.21 | learning rate: 8.876E-06 | global batch size: 16 | lm loss: 6.764441E+00 | grad norm: 2.068 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1694/ 128728 | consumed samples: 27104 | consumed tokens: 55508992 | elapsed time per iteration (s): 15.25 | learning rate: 8.881E-06 | global batch size: 16 | lm loss: 6.749056E+00 | grad norm: 0.931 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1695/ 128728 | consumed samples: 27120 | consumed tokens: 55541760 | elapsed time per iteration (s): 15.23 | learning rate: 8.887E-06 | global batch size: 16 | lm loss: 6.714129E+00 | grad norm: 0.792 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1696/ 128728 | consumed samples: 27136 | consumed tokens: 55574528 | elapsed time per iteration (s): 15.24 | learning rate: 8.892E-06 | global batch size: 16 | lm loss: 6.672210E+00 | grad norm: 0.876 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1697/ 128728 | consumed samples: 27152 | consumed tokens: 55607296 | elapsed time per iteration (s): 15.26 | learning rate: 8.897E-06 | global batch size: 16 | lm loss: 6.633732E+00 | grad norm: 1.072 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1698/ 128728 | consumed samples: 27168 | consumed tokens: 55640064 | elapsed time per iteration (s): 15.27 | learning rate: 8.902E-06 | global batch size: 16 | lm loss: 6.510969E+00 | grad norm: 0.831 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1699/ 128728 | consumed samples: 27184 | consumed tokens: 55672832 | elapsed time per iteration (s): 15.26 | learning rate: 8.908E-06 | global batch size: 16 | lm loss: 6.668943E+00 | grad norm: 0.802 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1700/ 128728 | consumed samples: 27200 | consumed tokens: 55705600 | elapsed time per iteration (s): 15.21 | learning rate: 8.913E-06 | global batch size: 16 | lm loss: 6.773491E+00 | grad norm: 0.973 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1701/ 128728 | consumed samples: 27216 | consumed tokens: 55738368 | elapsed time per iteration (s): 15.27 | learning rate: 8.918E-06 | global batch size: 16 | lm loss: 6.664038E+00 | grad norm: 0.858 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1702/ 128728 | consumed samples: 27232 | consumed tokens: 55771136 | elapsed time per iteration (s): 15.27 | learning rate: 8.923E-06 | global batch size: 16 | lm loss: 6.511447E+00 | grad norm: 1.224 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1703/ 128728 | consumed samples: 27248 | consumed tokens: 55803904 | elapsed time per iteration (s): 15.18 | learning rate: 8.929E-06 | global batch size: 16 | lm loss: 6.659809E+00 | grad norm: 1.415 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 1704/ 128728 | consumed samples: 27264 | consumed tokens: 55836672 | elapsed time per iteration (s): 15.24 | learning rate: 8.934E-06 | global batch size: 16 | lm loss: 6.674672E+00 | grad norm: 0.987 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1705/ 128728 | consumed samples: 27280 | consumed tokens: 55869440 | elapsed time per iteration (s): 15.25 | learning rate: 8.939E-06 | global batch size: 16 | lm loss: 6.860772E+00 | grad norm: 0.972 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1706/ 128728 | consumed samples: 27296 | consumed tokens: 55902208 | elapsed time per iteration (s): 15.24 | learning rate: 8.944E-06 | global batch size: 16 | lm loss: 6.839284E+00 | grad norm: 1.071 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1707/ 128728 | consumed samples: 27312 | consumed tokens: 55934976 | elapsed time per iteration (s): 15.24 | learning rate: 8.950E-06 | global batch size: 16 | lm loss: 6.650226E+00 | grad norm: 0.978 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1708/ 128728 | consumed samples: 27328 | consumed tokens: 55967744 | elapsed time per iteration (s): 15.25 | learning rate: 8.955E-06 | global batch size: 16 | lm loss: 6.606098E+00 | grad norm: 0.844 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1709/ 128728 | consumed samples: 27344 | consumed tokens: 56000512 | elapsed time per iteration (s): 15.23 | learning rate: 8.960E-06 | global batch size: 16 | lm loss: 6.536633E+00 | grad norm: 0.925 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1710/ 128728 | consumed samples: 27360 | consumed tokens: 56033280 | elapsed time per iteration (s): 15.23 | learning rate: 8.965E-06 | global batch size: 16 | lm loss: 6.541372E+00 | grad norm: 1.250 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1711/ 128728 | consumed samples: 27376 | consumed tokens: 56066048 | elapsed time per iteration (s): 15.22 | learning rate: 8.971E-06 | global batch size: 16 | lm loss: 6.686945E+00 | grad norm: 0.840 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1712/ 128728 | consumed samples: 27392 | consumed tokens: 56098816 | elapsed time per iteration (s): 15.27 | learning rate: 8.976E-06 | global batch size: 16 | lm loss: 6.757609E+00 | grad norm: 0.927 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1713/ 128728 | consumed samples: 27408 | consumed tokens: 56131584 | elapsed time per iteration (s): 15.25 | learning rate: 8.981E-06 | global batch size: 16 | lm loss: 6.817521E+00 | grad norm: 0.826 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1714/ 128728 | consumed samples: 27424 | consumed tokens: 56164352 | elapsed time per iteration (s): 15.24 | learning rate: 8.986E-06 | global batch size: 16 | lm loss: 6.714323E+00 | grad norm: 0.820 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1715/ 128728 | consumed samples: 27440 | consumed tokens: 56197120 | elapsed time per iteration (s): 15.25 | learning rate: 8.992E-06 | global batch size: 16 | lm loss: 6.906718E+00 | grad norm: 0.818 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.04 | [default7]: iteration 1716/ 128728 | consumed samples: 27456 | consumed tokens: 56229888 | elapsed time per iteration (s): 15.26 | learning rate: 8.997E-06 | global batch size: 16 | lm loss: 6.650734E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1717/ 128728 | consumed samples: 27472 | consumed tokens: 56262656 | elapsed time per iteration (s): 15.23 | learning rate: 9.002E-06 | global batch size: 16 | lm loss: 6.751576E+00 | grad norm: 0.859 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1718/ 128728 | consumed samples: 27488 | consumed tokens: 56295424 | elapsed time per iteration (s): 15.24 | learning rate: 9.007E-06 | global batch size: 16 | lm loss: 6.557451E+00 | grad norm: 0.911 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1719/ 128728 | consumed samples: 27504 | consumed tokens: 56328192 | elapsed time per iteration (s): 15.23 | learning rate: 9.013E-06 | global batch size: 16 | lm loss: 6.755507E+00 | grad norm: 0.929 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1720/ 128728 | consumed samples: 27520 | consumed tokens: 56360960 | elapsed time per iteration (s): 15.24 | learning rate: 9.018E-06 | global batch size: 16 | lm loss: 6.720637E+00 | grad norm: 0.930 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1721/ 128728 | consumed samples: 27536 | consumed tokens: 56393728 | elapsed time per iteration (s): 15.21 | learning rate: 9.023E-06 | global batch size: 16 | lm loss: 6.478833E+00 | grad norm: 1.104 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1722/ 128728 | consumed samples: 27552 | consumed tokens: 56426496 | elapsed time per iteration (s): 15.23 | learning rate: 9.028E-06 | global batch size: 16 | lm loss: 6.794544E+00 | grad norm: 0.904 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1723/ 128728 | consumed samples: 27568 | consumed tokens: 56459264 | elapsed time per iteration (s): 15.25 | learning rate: 9.034E-06 | global batch size: 16 | lm loss: 6.539186E+00 | grad norm: 0.858 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1724/ 128728 | consumed samples: 27584 | consumed tokens: 56492032 | elapsed time per iteration (s): 15.21 | learning rate: 9.039E-06 | global batch size: 16 | lm loss: 6.716591E+00 | grad norm: 0.916 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1725/ 128728 | consumed samples: 27600 | consumed tokens: 56524800 | elapsed time per iteration (s): 15.26 | learning rate: 9.044E-06 | global batch size: 16 | lm loss: 6.714130E+00 | grad norm: 0.815 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1726/ 128728 | consumed samples: 27616 | consumed tokens: 56557568 | elapsed time per iteration (s): 15.24 | learning rate: 9.049E-06 | global batch size: 16 | lm loss: 6.706204E+00 | grad norm: 1.262 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1727/ 128728 | consumed samples: 27632 | consumed tokens: 56590336 | elapsed time per iteration (s): 15.23 | learning rate: 9.054E-06 | global batch size: 16 | lm loss: 6.605562E+00 | grad norm: 1.054 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1728/ 128728 | consumed samples: 27648 | consumed tokens: 56623104 | elapsed time per iteration (s): 15.24 | learning rate: 9.060E-06 | global batch size: 16 | lm loss: 6.806660E+00 | grad norm: 0.890 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1729/ 128728 | consumed samples: 27664 | consumed tokens: 56655872 | elapsed time per iteration (s): 15.23 | learning rate: 9.065E-06 | global batch size: 16 | lm loss: 6.988270E+00 | grad norm: 0.896 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1730/ 128728 | consumed samples: 27680 | consumed tokens: 56688640 | elapsed time per iteration (s): 15.23 | learning rate: 9.070E-06 | global batch size: 16 | lm loss: 6.551892E+00 | grad norm: 0.839 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1731/ 128728 | consumed samples: 27696 | consumed tokens: 56721408 | elapsed time per iteration (s): 15.22 | learning rate: 9.075E-06 | global batch size: 16 | lm loss: 6.359119E+00 | grad norm: 0.821 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1732/ 128728 | consumed samples: 27712 | consumed tokens: 56754176 | elapsed time per iteration (s): 15.23 | learning rate: 9.081E-06 | global batch size: 16 | lm loss: 6.745327E+00 | grad norm: 0.813 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1733/ 128728 | consumed samples: 27728 | consumed tokens: 56786944 | elapsed time per iteration (s): 15.18 | learning rate: 9.086E-06 | global batch size: 16 | lm loss: 6.495726E+00 | grad norm: 0.834 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 1734/ 128728 | consumed samples: 27744 | consumed tokens: 56819712 | elapsed time per iteration (s): 15.23 | learning rate: 9.091E-06 | global batch size: 16 | lm loss: 6.595272E+00 | grad norm: 0.867 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1735/ 128728 | consumed samples: 27760 | consumed tokens: 56852480 | elapsed time per iteration (s): 15.24 | learning rate: 9.096E-06 | global batch size: 16 | lm loss: 6.750875E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1736/ 128728 | consumed samples: 27776 | consumed tokens: 56885248 | elapsed time per iteration (s): 15.25 | learning rate: 9.102E-06 | global batch size: 16 | lm loss: 6.515401E+00 | grad norm: 1.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1737/ 128728 | consumed samples: 27792 | consumed tokens: 56918016 | elapsed time per iteration (s): 15.24 | learning rate: 9.107E-06 | global batch size: 16 | lm loss: 6.513342E+00 | grad norm: 0.871 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1738/ 128728 | consumed samples: 27808 | consumed tokens: 56950784 | elapsed time per iteration (s): 15.23 | learning rate: 9.112E-06 | global batch size: 16 | lm loss: 6.627918E+00 | grad norm: 0.936 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1739/ 128728 | consumed samples: 27824 | consumed tokens: 56983552 | elapsed time per iteration (s): 15.24 | learning rate: 9.117E-06 | global batch size: 16 | lm loss: 6.685300E+00 | grad norm: 0.813 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1740/ 128728 | consumed samples: 27840 | consumed tokens: 57016320 | elapsed time per iteration (s): 15.26 | learning rate: 9.123E-06 | global batch size: 16 | lm loss: 6.637107E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1741/ 128728 | consumed samples: 27856 | consumed tokens: 57049088 | elapsed time per iteration (s): 15.21 | learning rate: 9.128E-06 | global batch size: 16 | lm loss: 6.694432E+00 | grad norm: 0.820 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1742/ 128728 | consumed samples: 27872 | consumed tokens: 57081856 | elapsed time per iteration (s): 15.29 | learning rate: 9.133E-06 | global batch size: 16 | lm loss: 6.972545E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.046 | TFLOPs: 8.01 | [default7]: iteration 1743/ 128728 | consumed samples: 27888 | consumed tokens: 57114624 | elapsed time per iteration (s): 15.23 | learning rate: 9.138E-06 | global batch size: 16 | lm loss: 6.513799E+00 | grad norm: 0.962 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1744/ 128728 | consumed samples: 27904 | consumed tokens: 57147392 | elapsed time per iteration (s): 15.21 | learning rate: 9.144E-06 | global batch size: 16 | lm loss: 6.752306E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1745/ 128728 | consumed samples: 27920 | consumed tokens: 57180160 | elapsed time per iteration (s): 15.20 | learning rate: 9.149E-06 | global batch size: 16 | lm loss: 6.714429E+00 | grad norm: 0.818 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1746/ 128728 | consumed samples: 27936 | consumed tokens: 57212928 | elapsed time per iteration (s): 15.24 | learning rate: 9.154E-06 | global batch size: 16 | lm loss: 6.613607E+00 | grad norm: 1.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1747/ 128728 | consumed samples: 27952 | consumed tokens: 57245696 | elapsed time per iteration (s): 15.20 | learning rate: 9.159E-06 | global batch size: 16 | lm loss: 6.643983E+00 | grad norm: 0.755 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1748/ 128728 | consumed samples: 27968 | consumed tokens: 57278464 | elapsed time per iteration (s): 15.23 | learning rate: 9.165E-06 | global batch size: 16 | lm loss: 6.584989E+00 | grad norm: 0.801 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1749/ 128728 | consumed samples: 27984 | consumed tokens: 57311232 | elapsed time per iteration (s): 15.19 | learning rate: 9.170E-06 | global batch size: 16 | lm loss: 6.636932E+00 | grad norm: 0.996 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 1750/ 128728 | consumed samples: 28000 | consumed tokens: 57344000 | elapsed time per iteration (s): 15.24 | learning rate: 9.175E-06 | global batch size: 16 | lm loss: 6.609263E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1751/ 128728 | consumed samples: 28016 | consumed tokens: 57376768 | elapsed time per iteration (s): 15.22 | learning rate: 9.180E-06 | global batch size: 16 | lm loss: 6.592394E+00 | grad norm: 0.858 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1752/ 128728 | consumed samples: 28032 | consumed tokens: 57409536 | elapsed time per iteration (s): 15.22 | learning rate: 9.186E-06 | global batch size: 16 | lm loss: 6.624197E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1753/ 128728 | consumed samples: 28048 | consumed tokens: 57442304 | elapsed time per iteration (s): 15.21 | learning rate: 9.191E-06 | global batch size: 16 | lm loss: 6.544185E+00 | grad norm: 0.846 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1754/ 128728 | consumed samples: 28064 | consumed tokens: 57475072 | elapsed time per iteration (s): 15.14 | learning rate: 9.196E-06 | global batch size: 16 | lm loss: 6.537138E+00 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 1755/ 128728 | consumed samples: 28080 | consumed tokens: 57507840 | elapsed time per iteration (s): 15.23 | learning rate: 9.201E-06 | global batch size: 16 | lm loss: 6.729046E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1756/ 128728 | consumed samples: 28096 | consumed tokens: 57540608 | elapsed time per iteration (s): 15.20 | learning rate: 9.207E-06 | global batch size: 16 | lm loss: 6.539384E+00 | grad norm: 0.857 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1757/ 128728 | consumed samples: 28112 | consumed tokens: 57573376 | elapsed time per iteration (s): 15.20 | learning rate: 9.212E-06 | global batch size: 16 | lm loss: 6.607846E+00 | grad norm: 0.859 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1758/ 128728 | consumed samples: 28128 | consumed tokens: 57606144 | elapsed time per iteration (s): 15.24 | learning rate: 9.217E-06 | global batch size: 16 | lm loss: 6.539383E+00 | grad norm: 1.002 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1759/ 128728 | consumed samples: 28144 | consumed tokens: 57638912 | elapsed time per iteration (s): 15.23 | learning rate: 9.222E-06 | global batch size: 16 | lm loss: 6.513782E+00 | grad norm: 0.804 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1760/ 128728 | consumed samples: 28160 | consumed tokens: 57671680 | elapsed time per iteration (s): 15.24 | learning rate: 9.227E-06 | global batch size: 16 | lm loss: 6.585566E+00 | grad norm: 0.945 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1761/ 128728 | consumed samples: 28176 | consumed tokens: 57704448 | elapsed time per iteration (s): 15.17 | learning rate: 9.233E-06 | global batch size: 16 | lm loss: 6.562658E+00 | grad norm: 0.787 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 1762/ 128728 | consumed samples: 28192 | consumed tokens: 57737216 | elapsed time per iteration (s): 15.24 | learning rate: 9.238E-06 | global batch size: 16 | lm loss: 6.564455E+00 | grad norm: 0.762 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1763/ 128728 | consumed samples: 28208 | consumed tokens: 57769984 | elapsed time per iteration (s): 15.27 | learning rate: 9.243E-06 | global batch size: 16 | lm loss: 6.471663E+00 | grad norm: 0.890 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1764/ 128728 | consumed samples: 28224 | consumed tokens: 57802752 | elapsed time per iteration (s): 15.16 | learning rate: 9.248E-06 | global batch size: 16 | lm loss: 6.601748E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 1765/ 128728 | consumed samples: 28240 | consumed tokens: 57835520 | elapsed time per iteration (s): 15.20 | learning rate: 9.254E-06 | global batch size: 16 | lm loss: 6.736907E+00 | grad norm: 0.840 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1766/ 128728 | consumed samples: 28256 | consumed tokens: 57868288 | elapsed time per iteration (s): 15.15 | learning rate: 9.259E-06 | global batch size: 16 | lm loss: 6.662552E+00 | grad norm: 0.885 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 1767/ 128728 | consumed samples: 28272 | consumed tokens: 57901056 | elapsed time per iteration (s): 15.23 | learning rate: 9.264E-06 | global batch size: 16 | lm loss: 6.668775E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1768/ 128728 | consumed samples: 28288 | consumed tokens: 57933824 | elapsed time per iteration (s): 15.19 | learning rate: 9.269E-06 | global batch size: 16 | lm loss: 6.802093E+00 | grad norm: 0.826 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 1769/ 128728 | consumed samples: 28304 | consumed tokens: 57966592 | elapsed time per iteration (s): 15.25 | learning rate: 9.275E-06 | global batch size: 16 | lm loss: 6.619685E+00 | grad norm: 1.255 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1770/ 128728 | consumed samples: 28320 | consumed tokens: 57999360 | elapsed time per iteration (s): 15.22 | learning rate: 9.280E-06 | global batch size: 16 | lm loss: 6.863540E+00 | grad norm: 0.873 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1771/ 128728 | consumed samples: 28336 | consumed tokens: 58032128 | elapsed time per iteration (s): 15.20 | learning rate: 9.285E-06 | global batch size: 16 | lm loss: 6.705997E+00 | grad norm: 0.778 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1772/ 128728 | consumed samples: 28352 | consumed tokens: 58064896 | elapsed time per iteration (s): 15.21 | learning rate: 9.290E-06 | global batch size: 16 | lm loss: 6.656632E+00 | grad norm: 0.788 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1773/ 128728 | consumed samples: 28368 | consumed tokens: 58097664 | elapsed time per iteration (s): 15.27 | learning rate: 9.296E-06 | global batch size: 16 | lm loss: 6.472975E+00 | grad norm: 1.079 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1774/ 128728 | consumed samples: 28384 | consumed tokens: 58130432 | elapsed time per iteration (s): 15.25 | learning rate: 9.301E-06 | global batch size: 16 | lm loss: 6.678162E+00 | grad norm: 1.091 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1775/ 128728 | consumed samples: 28400 | consumed tokens: 58163200 | elapsed time per iteration (s): 15.24 | learning rate: 9.306E-06 | global batch size: 16 | lm loss: 6.682146E+00 | grad norm: 0.899 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1776/ 128728 | consumed samples: 28416 | consumed tokens: 58195968 | elapsed time per iteration (s): 15.25 | learning rate: 9.311E-06 | global batch size: 16 | lm loss: 6.408243E+00 | grad norm: 0.978 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1777/ 128728 | consumed samples: 28432 | consumed tokens: 58228736 | elapsed time per iteration (s): 15.22 | learning rate: 9.317E-06 | global batch size: 16 | lm loss: 6.565637E+00 | grad norm: 0.816 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1778/ 128728 | consumed samples: 28448 | consumed tokens: 58261504 | elapsed time per iteration (s): 15.24 | learning rate: 9.322E-06 | global batch size: 16 | lm loss: 6.499868E+00 | grad norm: 0.786 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1779/ 128728 | consumed samples: 28464 | consumed tokens: 58294272 | elapsed time per iteration (s): 15.26 | learning rate: 9.327E-06 | global batch size: 16 | lm loss: 6.529296E+00 | grad norm: 1.027 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1780/ 128728 | consumed samples: 28480 | consumed tokens: 58327040 | elapsed time per iteration (s): 15.26 | learning rate: 9.332E-06 | global batch size: 16 | lm loss: 6.737674E+00 | grad norm: 0.829 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1781/ 128728 | consumed samples: 28496 | consumed tokens: 58359808 | elapsed time per iteration (s): 15.26 | learning rate: 9.338E-06 | global batch size: 16 | lm loss: 6.714326E+00 | grad norm: 1.247 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1782/ 128728 | consumed samples: 28512 | consumed tokens: 58392576 | elapsed time per iteration (s): 15.25 | learning rate: 9.343E-06 | global batch size: 16 | lm loss: 6.741374E+00 | grad norm: 0.991 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1783/ 128728 | consumed samples: 28528 | consumed tokens: 58425344 | elapsed time per iteration (s): 15.23 | learning rate: 9.348E-06 | global batch size: 16 | lm loss: 6.716001E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1784/ 128728 | consumed samples: 28544 | consumed tokens: 58458112 | elapsed time per iteration (s): 15.25 | learning rate: 9.353E-06 | global batch size: 16 | lm loss: 6.781655E+00 | grad norm: 0.948 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1785/ 128728 | consumed samples: 28560 | consumed tokens: 58490880 | elapsed time per iteration (s): 15.24 | learning rate: 9.359E-06 | global batch size: 16 | lm loss: 6.668215E+00 | grad norm: 0.810 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1786/ 128728 | consumed samples: 28576 | consumed tokens: 58523648 | elapsed time per iteration (s): 15.21 | learning rate: 9.364E-06 | global batch size: 16 | lm loss: 6.672732E+00 | grad norm: 0.843 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1787/ 128728 | consumed samples: 28592 | consumed tokens: 58556416 | elapsed time per iteration (s): 15.17 | learning rate: 9.369E-06 | global batch size: 16 | lm loss: 6.688550E+00 | grad norm: 0.846 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 1788/ 128728 | consumed samples: 28608 | consumed tokens: 58589184 | elapsed time per iteration (s): 15.27 | learning rate: 9.374E-06 | global batch size: 16 | lm loss: 6.671909E+00 | grad norm: 0.932 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1789/ 128728 | consumed samples: 28624 | consumed tokens: 58621952 | elapsed time per iteration (s): 15.24 | learning rate: 9.380E-06 | global batch size: 16 | lm loss: 6.553540E+00 | grad norm: 0.775 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1790/ 128728 | consumed samples: 28640 | consumed tokens: 58654720 | elapsed time per iteration (s): 15.24 | learning rate: 9.385E-06 | global batch size: 16 | lm loss: 6.730831E+00 | grad norm: 1.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1791/ 128728 | consumed samples: 28656 | consumed tokens: 58687488 | elapsed time per iteration (s): 15.26 | learning rate: 9.390E-06 | global batch size: 16 | lm loss: 6.436703E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1792/ 128728 | consumed samples: 28672 | consumed tokens: 58720256 | elapsed time per iteration (s): 15.22 | learning rate: 9.395E-06 | global batch size: 16 | lm loss: 6.470987E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1793/ 128728 | consumed samples: 28688 | consumed tokens: 58753024 | elapsed time per iteration (s): 15.22 | learning rate: 9.401E-06 | global batch size: 16 | lm loss: 6.891861E+00 | grad norm: 1.009 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1794/ 128728 | consumed samples: 28704 | consumed tokens: 58785792 | elapsed time per iteration (s): 15.25 | learning rate: 9.406E-06 | global batch size: 16 | lm loss: 6.579654E+00 | grad norm: 0.865 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1795/ 128728 | consumed samples: 28720 | consumed tokens: 58818560 | elapsed time per iteration (s): 15.20 | learning rate: 9.411E-06 | global batch size: 16 | lm loss: 6.601382E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1796/ 128728 | consumed samples: 28736 | consumed tokens: 58851328 | elapsed time per iteration (s): 15.22 | learning rate: 9.416E-06 | global batch size: 16 | lm loss: 6.712410E+00 | grad norm: 0.828 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1797/ 128728 | consumed samples: 28752 | consumed tokens: 58884096 | elapsed time per iteration (s): 15.28 | learning rate: 9.421E-06 | global batch size: 16 | lm loss: 6.652021E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 1798/ 128728 | consumed samples: 28768 | consumed tokens: 58916864 | elapsed time per iteration (s): 15.21 | learning rate: 9.427E-06 | global batch size: 16 | lm loss: 6.661202E+00 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1799/ 128728 | consumed samples: 28784 | consumed tokens: 58949632 | elapsed time per iteration (s): 15.21 | learning rate: 9.432E-06 | global batch size: 16 | lm loss: 6.523858E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1800/ 128728 | consumed samples: 28800 | consumed tokens: 58982400 | elapsed time per iteration (s): 15.24 | learning rate: 9.437E-06 | global batch size: 16 | lm loss: 6.623683E+00 | grad norm: 0.808 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1801/ 128728 | consumed samples: 28816 | consumed tokens: 59015168 | elapsed time per iteration (s): 15.23 | learning rate: 9.442E-06 | global batch size: 16 | lm loss: 6.714439E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1802/ 128728 | consumed samples: 28832 | consumed tokens: 59047936 | elapsed time per iteration (s): 15.23 | learning rate: 9.448E-06 | global batch size: 16 | lm loss: 6.545686E+00 | grad norm: 0.845 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1803/ 128728 | consumed samples: 28848 | consumed tokens: 59080704 | elapsed time per iteration (s): 15.24 | learning rate: 9.453E-06 | global batch size: 16 | lm loss: 6.519464E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1804/ 128728 | consumed samples: 28864 | consumed tokens: 59113472 | elapsed time per iteration (s): 15.22 | learning rate: 9.458E-06 | global batch size: 16 | lm loss: 6.891013E+00 | grad norm: 0.910 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1805/ 128728 | consumed samples: 28880 | consumed tokens: 59146240 | elapsed time per iteration (s): 15.21 | learning rate: 9.463E-06 | global batch size: 16 | lm loss: 6.718174E+00 | grad norm: 0.909 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1806/ 128728 | consumed samples: 28896 | consumed tokens: 59179008 | elapsed time per iteration (s): 15.24 | learning rate: 9.469E-06 | global batch size: 16 | lm loss: 6.641480E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1807/ 128728 | consumed samples: 28912 | consumed tokens: 59211776 | elapsed time per iteration (s): 15.23 | learning rate: 9.474E-06 | global batch size: 16 | lm loss: 6.519784E+00 | grad norm: 0.779 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1808/ 128728 | consumed samples: 28928 | consumed tokens: 59244544 | elapsed time per iteration (s): 15.21 | learning rate: 9.479E-06 | global batch size: 16 | lm loss: 6.584937E+00 | grad norm: 0.831 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1809/ 128728 | consumed samples: 28944 | consumed tokens: 59277312 | elapsed time per iteration (s): 15.22 | learning rate: 9.484E-06 | global batch size: 16 | lm loss: 6.330964E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1810/ 128728 | consumed samples: 28960 | consumed tokens: 59310080 | elapsed time per iteration (s): 15.23 | learning rate: 9.490E-06 | global batch size: 16 | lm loss: 7.042406E+00 | grad norm: 0.997 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1811/ 128728 | consumed samples: 28976 | consumed tokens: 59342848 | elapsed time per iteration (s): 15.23 | learning rate: 9.495E-06 | global batch size: 16 | lm loss: 6.472970E+00 | grad norm: 0.729 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1812/ 128728 | consumed samples: 28992 | consumed tokens: 59375616 | elapsed time per iteration (s): 15.23 | learning rate: 9.500E-06 | global batch size: 16 | lm loss: 6.761879E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1813/ 128728 | consumed samples: 29008 | consumed tokens: 59408384 | elapsed time per iteration (s): 15.24 | learning rate: 9.505E-06 | global batch size: 16 | lm loss: 6.489796E+00 | grad norm: 0.887 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1814/ 128728 | consumed samples: 29024 | consumed tokens: 59441152 | elapsed time per iteration (s): 15.24 | learning rate: 9.511E-06 | global batch size: 16 | lm loss: 6.517369E+00 | grad norm: 1.229 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1815/ 128728 | consumed samples: 29040 | consumed tokens: 59473920 | elapsed time per iteration (s): 15.24 | learning rate: 9.516E-06 | global batch size: 16 | lm loss: 6.735540E+00 | grad norm: 1.278 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1816/ 128728 | consumed samples: 29056 | consumed tokens: 59506688 | elapsed time per iteration (s): 15.22 | learning rate: 9.521E-06 | global batch size: 16 | lm loss: 6.628697E+00 | grad norm: 1.480 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1817/ 128728 | consumed samples: 29072 | consumed tokens: 59539456 | elapsed time per iteration (s): 15.22 | learning rate: 9.526E-06 | global batch size: 16 | lm loss: 6.515108E+00 | grad norm: 0.800 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1818/ 128728 | consumed samples: 29088 | consumed tokens: 59572224 | elapsed time per iteration (s): 15.24 | learning rate: 9.532E-06 | global batch size: 16 | lm loss: 6.639629E+00 | grad norm: 0.836 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1819/ 128728 | consumed samples: 29104 | consumed tokens: 59604992 | elapsed time per iteration (s): 15.22 | learning rate: 9.537E-06 | global batch size: 16 | lm loss: 6.651646E+00 | grad norm: 0.791 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1820/ 128728 | consumed samples: 29120 | consumed tokens: 59637760 | elapsed time per iteration (s): 15.22 | learning rate: 9.542E-06 | global batch size: 16 | lm loss: 6.575983E+00 | grad norm: 0.899 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1821/ 128728 | consumed samples: 29136 | consumed tokens: 59670528 | elapsed time per iteration (s): 15.15 | learning rate: 9.547E-06 | global batch size: 16 | lm loss: 6.677689E+00 | grad norm: 0.929 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 1822/ 128728 | consumed samples: 29152 | consumed tokens: 59703296 | elapsed time per iteration (s): 15.17 | learning rate: 9.553E-06 | global batch size: 16 | lm loss: 6.558556E+00 | grad norm: 0.776 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 1823/ 128728 | consumed samples: 29168 | consumed tokens: 59736064 | elapsed time per iteration (s): 15.22 | learning rate: 9.558E-06 | global batch size: 16 | lm loss: 6.579345E+00 | grad norm: 0.819 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1824/ 128728 | consumed samples: 29184 | consumed tokens: 59768832 | elapsed time per iteration (s): 15.18 | learning rate: 9.563E-06 | global batch size: 16 | lm loss: 6.645849E+00 | grad norm: 0.897 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 1825/ 128728 | consumed samples: 29200 | consumed tokens: 59801600 | elapsed time per iteration (s): 15.23 | learning rate: 9.568E-06 | global batch size: 16 | lm loss: 6.550450E+00 | grad norm: 0.890 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1826/ 128728 | consumed samples: 29216 | consumed tokens: 59834368 | elapsed time per iteration (s): 15.25 | learning rate: 9.574E-06 | global batch size: 16 | lm loss: 6.690180E+00 | grad norm: 0.949 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.04 | [default7]: iteration 1827/ 128728 | consumed samples: 29232 | consumed tokens: 59867136 | elapsed time per iteration (s): 15.26 | learning rate: 9.579E-06 | global batch size: 16 | lm loss: 6.688923E+00 | grad norm: 0.906 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1828/ 128728 | consumed samples: 29248 | consumed tokens: 59899904 | elapsed time per iteration (s): 15.25 | learning rate: 9.584E-06 | global batch size: 16 | lm loss: 6.797194E+00 | grad norm: 0.855 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1829/ 128728 | consumed samples: 29264 | consumed tokens: 59932672 | elapsed time per iteration (s): 15.28 | learning rate: 9.589E-06 | global batch size: 16 | lm loss: 6.436186E+00 | grad norm: 0.674 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 1830/ 128728 | consumed samples: 29280 | consumed tokens: 59965440 | elapsed time per iteration (s): 15.23 | learning rate: 9.594E-06 | global batch size: 16 | lm loss: 6.853899E+00 | grad norm: 0.821 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1831/ 128728 | consumed samples: 29296 | consumed tokens: 59998208 | elapsed time per iteration (s): 15.24 | learning rate: 9.600E-06 | global batch size: 16 | lm loss: 6.458448E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1832/ 128728 | consumed samples: 29312 | consumed tokens: 60030976 | elapsed time per iteration (s): 15.24 | learning rate: 9.605E-06 | global batch size: 16 | lm loss: 6.671127E+00 | grad norm: 0.938 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1833/ 128728 | consumed samples: 29328 | consumed tokens: 60063744 | elapsed time per iteration (s): 15.24 | learning rate: 9.610E-06 | global batch size: 16 | lm loss: 6.545115E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1834/ 128728 | consumed samples: 29344 | consumed tokens: 60096512 | elapsed time per iteration (s): 15.26 | learning rate: 9.615E-06 | global batch size: 16 | lm loss: 6.780546E+00 | grad norm: 1.384 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1835/ 128728 | consumed samples: 29360 | consumed tokens: 60129280 | elapsed time per iteration (s): 15.21 | learning rate: 9.621E-06 | global batch size: 16 | lm loss: 6.472826E+00 | grad norm: 0.835 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1836/ 128728 | consumed samples: 29376 | consumed tokens: 60162048 | elapsed time per iteration (s): 15.23 | learning rate: 9.626E-06 | global batch size: 16 | lm loss: 6.723257E+00 | grad norm: 0.891 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1837/ 128728 | consumed samples: 29392 | consumed tokens: 60194816 | elapsed time per iteration (s): 15.24 | learning rate: 9.631E-06 | global batch size: 16 | lm loss: 6.483226E+00 | grad norm: 1.005 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1838/ 128728 | consumed samples: 29408 | consumed tokens: 60227584 | elapsed time per iteration (s): 15.23 | learning rate: 9.636E-06 | global batch size: 16 | lm loss: 6.481052E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1839/ 128728 | consumed samples: 29424 | consumed tokens: 60260352 | elapsed time per iteration (s): 15.21 | learning rate: 9.642E-06 | global batch size: 16 | lm loss: 6.451497E+00 | grad norm: 0.762 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1840/ 128728 | consumed samples: 29440 | consumed tokens: 60293120 | elapsed time per iteration (s): 15.21 | learning rate: 9.647E-06 | global batch size: 16 | lm loss: 6.728784E+00 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1841/ 128728 | consumed samples: 29456 | consumed tokens: 60325888 | elapsed time per iteration (s): 15.23 | learning rate: 9.652E-06 | global batch size: 16 | lm loss: 6.508964E+00 | grad norm: 0.974 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1842/ 128728 | consumed samples: 29472 | consumed tokens: 60358656 | elapsed time per iteration (s): 15.22 | learning rate: 9.657E-06 | global batch size: 16 | lm loss: 6.681833E+00 | grad norm: 0.846 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1843/ 128728 | consumed samples: 29488 | consumed tokens: 60391424 | elapsed time per iteration (s): 15.27 | learning rate: 9.663E-06 | global batch size: 16 | lm loss: 6.648950E+00 | grad norm: 0.795 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1844/ 128728 | consumed samples: 29504 | consumed tokens: 60424192 | elapsed time per iteration (s): 15.21 | learning rate: 9.668E-06 | global batch size: 16 | lm loss: 6.666204E+00 | grad norm: 1.899 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1845/ 128728 | consumed samples: 29520 | consumed tokens: 60456960 | elapsed time per iteration (s): 15.22 | learning rate: 9.673E-06 | global batch size: 16 | lm loss: 6.498180E+00 | grad norm: 0.831 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1846/ 128728 | consumed samples: 29536 | consumed tokens: 60489728 | elapsed time per iteration (s): 15.22 | learning rate: 9.678E-06 | global batch size: 16 | lm loss: 6.420746E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1847/ 128728 | consumed samples: 29552 | consumed tokens: 60522496 | elapsed time per iteration (s): 15.33 | learning rate: 9.684E-06 | global batch size: 16 | lm loss: 6.513135E+00 | grad norm: 1.344 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.043 | TFLOPs: 7.99 | [default7]: iteration 1848/ 128728 | consumed samples: 29568 | consumed tokens: 60555264 | elapsed time per iteration (s): 15.27 | learning rate: 9.689E-06 | global batch size: 16 | lm loss: 6.598331E+00 | grad norm: 1.024 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1849/ 128728 | consumed samples: 29584 | consumed tokens: 60588032 | elapsed time per iteration (s): 15.23 | learning rate: 9.694E-06 | global batch size: 16 | lm loss: 6.658598E+00 | grad norm: 0.859 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1850/ 128728 | consumed samples: 29600 | consumed tokens: 60620800 | elapsed time per iteration (s): 15.22 | learning rate: 9.699E-06 | global batch size: 16 | lm loss: 6.735951E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1851/ 128728 | consumed samples: 29616 | consumed tokens: 60653568 | elapsed time per iteration (s): 15.21 | learning rate: 9.705E-06 | global batch size: 16 | lm loss: 6.589662E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1852/ 128728 | consumed samples: 29632 | consumed tokens: 60686336 | elapsed time per iteration (s): 15.27 | learning rate: 9.710E-06 | global batch size: 16 | lm loss: 6.598696E+00 | grad norm: 0.819 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1853/ 128728 | consumed samples: 29648 | consumed tokens: 60719104 | elapsed time per iteration (s): 15.27 | learning rate: 9.715E-06 | global batch size: 16 | lm loss: 6.593414E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1854/ 128728 | consumed samples: 29664 | consumed tokens: 60751872 | elapsed time per iteration (s): 15.25 | learning rate: 9.720E-06 | global batch size: 16 | lm loss: 6.430328E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1855/ 128728 | consumed samples: 29680 | consumed tokens: 60784640 | elapsed time per iteration (s): 15.22 | learning rate: 9.726E-06 | global batch size: 16 | lm loss: 6.661034E+00 | grad norm: 0.877 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1856/ 128728 | consumed samples: 29696 | consumed tokens: 60817408 | elapsed time per iteration (s): 15.26 | learning rate: 9.731E-06 | global batch size: 16 | lm loss: 6.709377E+00 | grad norm: 1.023 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1857/ 128728 | consumed samples: 29712 | consumed tokens: 60850176 | elapsed time per iteration (s): 15.21 | learning rate: 9.736E-06 | global batch size: 16 | lm loss: 6.679298E+00 | grad norm: 0.919 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1858/ 128728 | consumed samples: 29728 | consumed tokens: 60882944 | elapsed time per iteration (s): 15.25 | learning rate: 9.741E-06 | global batch size: 16 | lm loss: 6.827992E+00 | grad norm: 0.867 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1859/ 128728 | consumed samples: 29744 | consumed tokens: 60915712 | elapsed time per iteration (s): 15.24 | learning rate: 9.747E-06 | global batch size: 16 | lm loss: 6.568586E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1860/ 128728 | consumed samples: 29760 | consumed tokens: 60948480 | elapsed time per iteration (s): 15.21 | learning rate: 9.752E-06 | global batch size: 16 | lm loss: 6.410093E+00 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1861/ 128728 | consumed samples: 29776 | consumed tokens: 60981248 | elapsed time per iteration (s): 15.22 | learning rate: 9.757E-06 | global batch size: 16 | lm loss: 6.323568E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1862/ 128728 | consumed samples: 29792 | consumed tokens: 61014016 | elapsed time per iteration (s): 15.21 | learning rate: 9.762E-06 | global batch size: 16 | lm loss: 6.819780E+00 | grad norm: 0.883 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1863/ 128728 | consumed samples: 29808 | consumed tokens: 61046784 | elapsed time per iteration (s): 15.24 | learning rate: 9.768E-06 | global batch size: 16 | lm loss: 6.857122E+00 | grad norm: 1.314 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1864/ 128728 | consumed samples: 29824 | consumed tokens: 61079552 | elapsed time per iteration (s): 15.27 | learning rate: 9.773E-06 | global batch size: 16 | lm loss: 6.621314E+00 | grad norm: 0.866 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1865/ 128728 | consumed samples: 29840 | consumed tokens: 61112320 | elapsed time per iteration (s): 15.20 | learning rate: 9.778E-06 | global batch size: 16 | lm loss: 6.558571E+00 | grad norm: 0.864 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1866/ 128728 | consumed samples: 29856 | consumed tokens: 61145088 | elapsed time per iteration (s): 15.21 | learning rate: 9.783E-06 | global batch size: 16 | lm loss: 6.498933E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1867/ 128728 | consumed samples: 29872 | consumed tokens: 61177856 | elapsed time per iteration (s): 15.22 | learning rate: 9.788E-06 | global batch size: 16 | lm loss: 6.822206E+00 | grad norm: 0.955 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1868/ 128728 | consumed samples: 29888 | consumed tokens: 61210624 | elapsed time per iteration (s): 15.18 | learning rate: 9.794E-06 | global batch size: 16 | lm loss: 6.600270E+00 | grad norm: 0.979 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 1869/ 128728 | consumed samples: 29904 | consumed tokens: 61243392 | elapsed time per iteration (s): 15.22 | learning rate: 9.799E-06 | global batch size: 16 | lm loss: 6.469594E+00 | grad norm: 0.983 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1870/ 128728 | consumed samples: 29920 | consumed tokens: 61276160 | elapsed time per iteration (s): 15.25 | learning rate: 9.804E-06 | global batch size: 16 | lm loss: 6.446286E+00 | grad norm: 0.771 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1871/ 128728 | consumed samples: 29936 | consumed tokens: 61308928 | elapsed time per iteration (s): 15.23 | learning rate: 9.809E-06 | global batch size: 16 | lm loss: 6.491003E+00 | grad norm: 0.895 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1872/ 128728 | consumed samples: 29952 | consumed tokens: 61341696 | elapsed time per iteration (s): 15.23 | learning rate: 9.815E-06 | global batch size: 16 | lm loss: 6.493572E+00 | grad norm: 0.955 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1873/ 128728 | consumed samples: 29968 | consumed tokens: 61374464 | elapsed time per iteration (s): 15.24 | learning rate: 9.820E-06 | global batch size: 16 | lm loss: 6.607419E+00 | grad norm: 1.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1874/ 128728 | consumed samples: 29984 | consumed tokens: 61407232 | elapsed time per iteration (s): 15.20 | learning rate: 9.825E-06 | global batch size: 16 | lm loss: 6.643625E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1875/ 128728 | consumed samples: 30000 | consumed tokens: 61440000 | elapsed time per iteration (s): 15.21 | learning rate: 9.830E-06 | global batch size: 16 | lm loss: 6.527872E+00 | grad norm: 1.017 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1876/ 128728 | consumed samples: 30016 | consumed tokens: 61472768 | elapsed time per iteration (s): 15.23 | learning rate: 9.836E-06 | global batch size: 16 | lm loss: 6.579536E+00 | grad norm: 0.880 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1877/ 128728 | consumed samples: 30032 | consumed tokens: 61505536 | elapsed time per iteration (s): 15.27 | learning rate: 9.841E-06 | global batch size: 16 | lm loss: 6.619586E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1878/ 128728 | consumed samples: 30048 | consumed tokens: 61538304 | elapsed time per iteration (s): 15.19 | learning rate: 9.846E-06 | global batch size: 16 | lm loss: 6.514913E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1879/ 128728 | consumed samples: 30064 | consumed tokens: 61571072 | elapsed time per iteration (s): 15.24 | learning rate: 9.851E-06 | global batch size: 16 | lm loss: 6.534479E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1880/ 128728 | consumed samples: 30080 | consumed tokens: 61603840 | elapsed time per iteration (s): 15.30 | learning rate: 9.857E-06 | global batch size: 16 | lm loss: 6.383130E+00 | grad norm: 1.318 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.045 | TFLOPs: 8.00 | [default7]: iteration 1881/ 128728 | consumed samples: 30096 | consumed tokens: 61636608 | elapsed time per iteration (s): 15.27 | learning rate: 9.862E-06 | global batch size: 16 | lm loss: 6.530272E+00 | grad norm: 0.896 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1882/ 128728 | consumed samples: 30112 | consumed tokens: 61669376 | elapsed time per iteration (s): 15.23 | learning rate: 9.867E-06 | global batch size: 16 | lm loss: 6.505867E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1883/ 128728 | consumed samples: 30128 | consumed tokens: 61702144 | elapsed time per iteration (s): 15.23 | learning rate: 9.872E-06 | global batch size: 16 | lm loss: 6.482748E+00 | grad norm: 0.796 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1884/ 128728 | consumed samples: 30144 | consumed tokens: 61734912 | elapsed time per iteration (s): 15.26 | learning rate: 9.878E-06 | global batch size: 16 | lm loss: 6.664700E+00 | grad norm: 2.959 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1885/ 128728 | consumed samples: 30160 | consumed tokens: 61767680 | elapsed time per iteration (s): 15.22 | learning rate: 9.883E-06 | global batch size: 16 | lm loss: 6.515076E+00 | grad norm: 0.814 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1886/ 128728 | consumed samples: 30176 | consumed tokens: 61800448 | elapsed time per iteration (s): 15.27 | learning rate: 9.888E-06 | global batch size: 16 | lm loss: 6.282681E+00 | grad norm: 0.738 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 1887/ 128728 | consumed samples: 30192 | consumed tokens: 61833216 | elapsed time per iteration (s): 15.23 | learning rate: 9.893E-06 | global batch size: 16 | lm loss: 6.580127E+00 | grad norm: 1.281 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1888/ 128728 | consumed samples: 30208 | consumed tokens: 61865984 | elapsed time per iteration (s): 15.22 | learning rate: 9.899E-06 | global batch size: 16 | lm loss: 6.476642E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1889/ 128728 | consumed samples: 30224 | consumed tokens: 61898752 | elapsed time per iteration (s): 15.23 | learning rate: 9.904E-06 | global batch size: 16 | lm loss: 6.487213E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1890/ 128728 | consumed samples: 30240 | consumed tokens: 61931520 | elapsed time per iteration (s): 15.19 | learning rate: 9.909E-06 | global batch size: 16 | lm loss: 6.532672E+00 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 1891/ 128728 | consumed samples: 30256 | consumed tokens: 61964288 | elapsed time per iteration (s): 15.23 | learning rate: 9.914E-06 | global batch size: 16 | lm loss: 6.400381E+00 | grad norm: 0.806 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1892/ 128728 | consumed samples: 30272 | consumed tokens: 61997056 | elapsed time per iteration (s): 15.25 | learning rate: 9.920E-06 | global batch size: 16 | lm loss: 6.453693E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1893/ 128728 | consumed samples: 30288 | consumed tokens: 62029824 | elapsed time per iteration (s): 15.22 | learning rate: 9.925E-06 | global batch size: 16 | lm loss: 6.528496E+00 | grad norm: 0.875 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1894/ 128728 | consumed samples: 30304 | consumed tokens: 62062592 | elapsed time per iteration (s): 15.26 | learning rate: 9.930E-06 | global batch size: 16 | lm loss: 6.691092E+00 | grad norm: 0.930 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1895/ 128728 | consumed samples: 30320 | consumed tokens: 62095360 | elapsed time per iteration (s): 15.23 | learning rate: 9.935E-06 | global batch size: 16 | lm loss: 6.338684E+00 | grad norm: 0.795 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1896/ 128728 | consumed samples: 30336 | consumed tokens: 62128128 | elapsed time per iteration (s): 15.28 | learning rate: 9.941E-06 | global batch size: 16 | lm loss: 6.594782E+00 | grad norm: 0.827 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 1897/ 128728 | consumed samples: 30352 | consumed tokens: 62160896 | elapsed time per iteration (s): 15.20 | learning rate: 9.946E-06 | global batch size: 16 | lm loss: 6.504727E+00 | grad norm: 0.796 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1898/ 128728 | consumed samples: 30368 | consumed tokens: 62193664 | elapsed time per iteration (s): 15.16 | learning rate: 9.951E-06 | global batch size: 16 | lm loss: 6.835838E+00 | grad norm: 0.907 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 1899/ 128728 | consumed samples: 30384 | consumed tokens: 62226432 | elapsed time per iteration (s): 15.25 | learning rate: 9.956E-06 | global batch size: 16 | lm loss: 6.691212E+00 | grad norm: 0.937 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1900/ 128728 | consumed samples: 30400 | consumed tokens: 62259200 | elapsed time per iteration (s): 15.21 | learning rate: 9.961E-06 | global batch size: 16 | lm loss: 6.594204E+00 | grad norm: 0.899 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1901/ 128728 | consumed samples: 30416 | consumed tokens: 62291968 | elapsed time per iteration (s): 15.24 | learning rate: 9.967E-06 | global batch size: 16 | lm loss: 6.573639E+00 | grad norm: 0.734 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1902/ 128728 | consumed samples: 30432 | consumed tokens: 62324736 | elapsed time per iteration (s): 15.20 | learning rate: 9.972E-06 | global batch size: 16 | lm loss: 6.642185E+00 | grad norm: 0.947 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1903/ 128728 | consumed samples: 30448 | consumed tokens: 62357504 | elapsed time per iteration (s): 15.20 | learning rate: 9.977E-06 | global batch size: 16 | lm loss: 6.638869E+00 | grad norm: 1.003 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1904/ 128728 | consumed samples: 30464 | consumed tokens: 62390272 | elapsed time per iteration (s): 15.18 | learning rate: 9.982E-06 | global batch size: 16 | lm loss: 6.439603E+00 | grad norm: 0.788 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 1905/ 128728 | consumed samples: 30480 | consumed tokens: 62423040 | elapsed time per iteration (s): 15.25 | learning rate: 9.988E-06 | global batch size: 16 | lm loss: 6.637027E+00 | grad norm: 0.786 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1906/ 128728 | consumed samples: 30496 | consumed tokens: 62455808 | elapsed time per iteration (s): 15.15 | learning rate: 9.993E-06 | global batch size: 16 | lm loss: 6.455775E+00 | grad norm: 0.919 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 1907/ 128728 | consumed samples: 30512 | consumed tokens: 62488576 | elapsed time per iteration (s): 15.17 | learning rate: 9.998E-06 | global batch size: 16 | lm loss: 6.424469E+00 | grad norm: 0.835 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 1908/ 128728 | consumed samples: 30528 | consumed tokens: 62521344 | elapsed time per iteration (s): 15.20 | learning rate: 1.000E-05 | global batch size: 16 | lm loss: 6.547606E+00 | grad norm: 1.458 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1909/ 128728 | consumed samples: 30544 | consumed tokens: 62554112 | elapsed time per iteration (s): 15.22 | learning rate: 1.001E-05 | global batch size: 16 | lm loss: 6.466846E+00 | grad norm: 0.796 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1910/ 128728 | consumed samples: 30560 | consumed tokens: 62586880 | elapsed time per iteration (s): 15.24 | learning rate: 1.001E-05 | global batch size: 16 | lm loss: 6.650313E+00 | grad norm: 1.447 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1911/ 128728 | consumed samples: 30576 | consumed tokens: 62619648 | elapsed time per iteration (s): 15.22 | learning rate: 1.002E-05 | global batch size: 16 | lm loss: 6.618893E+00 | grad norm: 1.340 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1912/ 128728 | consumed samples: 30592 | consumed tokens: 62652416 | elapsed time per iteration (s): 15.21 | learning rate: 1.002E-05 | global batch size: 16 | lm loss: 6.551538E+00 | grad norm: 1.040 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1913/ 128728 | consumed samples: 30608 | consumed tokens: 62685184 | elapsed time per iteration (s): 15.26 | learning rate: 1.003E-05 | global batch size: 16 | lm loss: 6.546391E+00 | grad norm: 0.846 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1914/ 128728 | consumed samples: 30624 | consumed tokens: 62717952 | elapsed time per iteration (s): 15.25 | learning rate: 1.003E-05 | global batch size: 16 | lm loss: 6.704463E+00 | grad norm: 0.850 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1915/ 128728 | consumed samples: 30640 | consumed tokens: 62750720 | elapsed time per iteration (s): 15.22 | learning rate: 1.004E-05 | global batch size: 16 | lm loss: 6.473845E+00 | grad norm: 1.044 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1916/ 128728 | consumed samples: 30656 | consumed tokens: 62783488 | elapsed time per iteration (s): 15.23 | learning rate: 1.005E-05 | global batch size: 16 | lm loss: 6.693832E+00 | grad norm: 0.941 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1917/ 128728 | consumed samples: 30672 | consumed tokens: 62816256 | elapsed time per iteration (s): 15.21 | learning rate: 1.005E-05 | global batch size: 16 | lm loss: 6.588843E+00 | grad norm: 0.773 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1918/ 128728 | consumed samples: 30688 | consumed tokens: 62849024 | elapsed time per iteration (s): 15.21 | learning rate: 1.006E-05 | global batch size: 16 | lm loss: 6.421237E+00 | grad norm: 0.804 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1919/ 128728 | consumed samples: 30704 | consumed tokens: 62881792 | elapsed time per iteration (s): 15.21 | learning rate: 1.006E-05 | global batch size: 16 | lm loss: 6.483512E+00 | grad norm: 0.904 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1920/ 128728 | consumed samples: 30720 | consumed tokens: 62914560 | elapsed time per iteration (s): 15.19 | learning rate: 1.007E-05 | global batch size: 16 | lm loss: 6.566906E+00 | grad norm: 1.538 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1921/ 128728 | consumed samples: 30736 | consumed tokens: 62947328 | elapsed time per iteration (s): 15.21 | learning rate: 1.007E-05 | global batch size: 16 | lm loss: 6.512776E+00 | grad norm: 0.945 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1922/ 128728 | consumed samples: 30752 | consumed tokens: 62980096 | elapsed time per iteration (s): 15.22 | learning rate: 1.008E-05 | global batch size: 16 | lm loss: 6.370068E+00 | grad norm: 0.826 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1923/ 128728 | consumed samples: 30768 | consumed tokens: 63012864 | elapsed time per iteration (s): 15.21 | learning rate: 1.008E-05 | global batch size: 16 | lm loss: 6.588835E+00 | grad norm: 0.839 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1924/ 128728 | consumed samples: 30784 | consumed tokens: 63045632 | elapsed time per iteration (s): 15.21 | learning rate: 1.009E-05 | global batch size: 16 | lm loss: 6.359224E+00 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1925/ 128728 | consumed samples: 30800 | consumed tokens: 63078400 | elapsed time per iteration (s): 15.23 | learning rate: 1.009E-05 | global batch size: 16 | lm loss: 6.767123E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1926/ 128728 | consumed samples: 30816 | consumed tokens: 63111168 | elapsed time per iteration (s): 15.22 | learning rate: 1.010E-05 | global batch size: 16 | lm loss: 6.477892E+00 | grad norm: 0.798 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1927/ 128728 | consumed samples: 30832 | consumed tokens: 63143936 | elapsed time per iteration (s): 15.15 | learning rate: 1.010E-05 | global batch size: 16 | lm loss: 6.328223E+00 | grad norm: 0.886 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 1928/ 128728 | consumed samples: 30848 | consumed tokens: 63176704 | elapsed time per iteration (s): 15.22 | learning rate: 1.011E-05 | global batch size: 16 | lm loss: 6.486270E+00 | grad norm: 0.759 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1929/ 128728 | consumed samples: 30864 | consumed tokens: 63209472 | elapsed time per iteration (s): 15.21 | learning rate: 1.011E-05 | global batch size: 16 | lm loss: 6.472905E+00 | grad norm: 0.994 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1930/ 128728 | consumed samples: 30880 | consumed tokens: 63242240 | elapsed time per iteration (s): 15.22 | learning rate: 1.012E-05 | global batch size: 16 | lm loss: 6.811383E+00 | grad norm: 0.842 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1931/ 128728 | consumed samples: 30896 | consumed tokens: 63275008 | elapsed time per iteration (s): 15.25 | learning rate: 1.012E-05 | global batch size: 16 | lm loss: 6.692072E+00 | grad norm: 0.941 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1932/ 128728 | consumed samples: 30912 | consumed tokens: 63307776 | elapsed time per iteration (s): 15.19 | learning rate: 1.013E-05 | global batch size: 16 | lm loss: 6.495762E+00 | grad norm: 0.799 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1933/ 128728 | consumed samples: 30928 | consumed tokens: 63340544 | elapsed time per iteration (s): 15.21 | learning rate: 1.013E-05 | global batch size: 16 | lm loss: 6.484449E+00 | grad norm: 0.816 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1934/ 128728 | consumed samples: 30944 | consumed tokens: 63373312 | elapsed time per iteration (s): 15.20 | learning rate: 1.014E-05 | global batch size: 16 | lm loss: 6.561663E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1935/ 128728 | consumed samples: 30960 | consumed tokens: 63406080 | elapsed time per iteration (s): 15.22 | learning rate: 1.014E-05 | global batch size: 16 | lm loss: 6.621759E+00 | grad norm: 0.933 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1936/ 128728 | consumed samples: 30976 | consumed tokens: 63438848 | elapsed time per iteration (s): 15.22 | learning rate: 1.015E-05 | global batch size: 16 | lm loss: 6.439867E+00 | grad norm: 0.836 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1937/ 128728 | consumed samples: 30992 | consumed tokens: 63471616 | elapsed time per iteration (s): 15.26 | learning rate: 1.016E-05 | global batch size: 16 | lm loss: 6.363036E+00 | grad norm: 0.804 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1938/ 128728 | consumed samples: 31008 | consumed tokens: 63504384 | elapsed time per iteration (s): 15.26 | learning rate: 1.016E-05 | global batch size: 16 | lm loss: 6.514183E+00 | grad norm: 0.816 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1939/ 128728 | consumed samples: 31024 | consumed tokens: 63537152 | elapsed time per iteration (s): 15.26 | learning rate: 1.017E-05 | global batch size: 16 | lm loss: 6.339239E+00 | grad norm: 0.711 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1940/ 128728 | consumed samples: 31040 | consumed tokens: 63569920 | elapsed time per iteration (s): 15.22 | learning rate: 1.017E-05 | global batch size: 16 | lm loss: 6.654146E+00 | grad norm: 0.980 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1941/ 128728 | consumed samples: 31056 | consumed tokens: 63602688 | elapsed time per iteration (s): 15.23 | learning rate: 1.018E-05 | global batch size: 16 | lm loss: 6.603597E+00 | grad norm: 0.828 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1942/ 128728 | consumed samples: 31072 | consumed tokens: 63635456 | elapsed time per iteration (s): 15.23 | learning rate: 1.018E-05 | global batch size: 16 | lm loss: 6.599665E+00 | grad norm: 3.648 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1943/ 128728 | consumed samples: 31088 | consumed tokens: 63668224 | elapsed time per iteration (s): 15.21 | learning rate: 1.019E-05 | global batch size: 16 | lm loss: 6.663511E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1944/ 128728 | consumed samples: 31104 | consumed tokens: 63700992 | elapsed time per iteration (s): 15.21 | learning rate: 1.019E-05 | global batch size: 16 | lm loss: 6.307026E+00 | grad norm: 0.838 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1945/ 128728 | consumed samples: 31120 | consumed tokens: 63733760 | elapsed time per iteration (s): 15.22 | learning rate: 1.020E-05 | global batch size: 16 | lm loss: 6.489582E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1946/ 128728 | consumed samples: 31136 | consumed tokens: 63766528 | elapsed time per iteration (s): 15.26 | learning rate: 1.020E-05 | global batch size: 16 | lm loss: 6.788570E+00 | grad norm: 0.903 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1947/ 128728 | consumed samples: 31152 | consumed tokens: 63799296 | elapsed time per iteration (s): 15.24 | learning rate: 1.021E-05 | global batch size: 16 | lm loss: 6.571981E+00 | grad norm: 1.260 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1948/ 128728 | consumed samples: 31168 | consumed tokens: 63832064 | elapsed time per iteration (s): 15.21 | learning rate: 1.021E-05 | global batch size: 16 | lm loss: 6.630430E+00 | grad norm: 0.971 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1949/ 128728 | consumed samples: 31184 | consumed tokens: 63864832 | elapsed time per iteration (s): 15.21 | learning rate: 1.022E-05 | global batch size: 16 | lm loss: 6.470918E+00 | grad norm: 0.817 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1950/ 128728 | consumed samples: 31200 | consumed tokens: 63897600 | elapsed time per iteration (s): 15.28 | learning rate: 1.022E-05 | global batch size: 16 | lm loss: 6.354256E+00 | grad norm: 1.022 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 1951/ 128728 | consumed samples: 31216 | consumed tokens: 63930368 | elapsed time per iteration (s): 15.16 | learning rate: 1.023E-05 | global batch size: 16 | lm loss: 6.493493E+00 | grad norm: 0.819 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 1952/ 128728 | consumed samples: 31232 | consumed tokens: 63963136 | elapsed time per iteration (s): 15.20 | learning rate: 1.023E-05 | global batch size: 16 | lm loss: 6.460168E+00 | grad norm: 0.775 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1953/ 128728 | consumed samples: 31248 | consumed tokens: 63995904 | elapsed time per iteration (s): 15.22 | learning rate: 1.024E-05 | global batch size: 16 | lm loss: 6.540512E+00 | grad norm: 1.009 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1954/ 128728 | consumed samples: 31264 | consumed tokens: 64028672 | elapsed time per iteration (s): 15.21 | learning rate: 1.024E-05 | global batch size: 16 | lm loss: 6.298806E+00 | grad norm: 0.808 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1955/ 128728 | consumed samples: 31280 | consumed tokens: 64061440 | elapsed time per iteration (s): 15.23 | learning rate: 1.025E-05 | global batch size: 16 | lm loss: 6.592202E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1956/ 128728 | consumed samples: 31296 | consumed tokens: 64094208 | elapsed time per iteration (s): 15.22 | learning rate: 1.026E-05 | global batch size: 16 | lm loss: 6.384544E+00 | grad norm: 0.950 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1957/ 128728 | consumed samples: 31312 | consumed tokens: 64126976 | elapsed time per iteration (s): 15.23 | learning rate: 1.026E-05 | global batch size: 16 | lm loss: 6.758242E+00 | grad norm: 1.276 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1958/ 128728 | consumed samples: 31328 | consumed tokens: 64159744 | elapsed time per iteration (s): 15.24 | learning rate: 1.027E-05 | global batch size: 16 | lm loss: 6.602652E+00 | grad norm: 0.813 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1959/ 128728 | consumed samples: 31344 | consumed tokens: 64192512 | elapsed time per iteration (s): 15.20 | learning rate: 1.027E-05 | global batch size: 16 | lm loss: 6.728225E+00 | grad norm: 0.909 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1960/ 128728 | consumed samples: 31360 | consumed tokens: 64225280 | elapsed time per iteration (s): 15.20 | learning rate: 1.028E-05 | global batch size: 16 | lm loss: 6.458584E+00 | grad norm: 0.984 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1961/ 128728 | consumed samples: 31376 | consumed tokens: 64258048 | elapsed time per iteration (s): 15.21 | learning rate: 1.028E-05 | global batch size: 16 | lm loss: 6.611272E+00 | grad norm: 0.833 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1962/ 128728 | consumed samples: 31392 | consumed tokens: 64290816 | elapsed time per iteration (s): 15.18 | learning rate: 1.029E-05 | global batch size: 16 | lm loss: 6.663339E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 1963/ 128728 | consumed samples: 31408 | consumed tokens: 64323584 | elapsed time per iteration (s): 15.24 | learning rate: 1.029E-05 | global batch size: 16 | lm loss: 6.305027E+00 | grad norm: 0.940 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1964/ 128728 | consumed samples: 31424 | consumed tokens: 64356352 | elapsed time per iteration (s): 15.24 | learning rate: 1.030E-05 | global batch size: 16 | lm loss: 6.693589E+00 | grad norm: 0.917 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1965/ 128728 | consumed samples: 31440 | consumed tokens: 64389120 | elapsed time per iteration (s): 15.26 | learning rate: 1.030E-05 | global batch size: 16 | lm loss: 6.589158E+00 | grad norm: 1.256 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 1966/ 128728 | consumed samples: 31456 | consumed tokens: 64421888 | elapsed time per iteration (s): 15.20 | learning rate: 1.031E-05 | global batch size: 16 | lm loss: 6.519398E+00 | grad norm: 0.777 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1967/ 128728 | consumed samples: 31472 | consumed tokens: 64454656 | elapsed time per iteration (s): 15.23 | learning rate: 1.031E-05 | global batch size: 16 | lm loss: 6.615813E+00 | grad norm: 1.512 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1968/ 128728 | consumed samples: 31488 | consumed tokens: 64487424 | elapsed time per iteration (s): 15.24 | learning rate: 1.032E-05 | global batch size: 16 | lm loss: 6.581736E+00 | grad norm: 1.318 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1969/ 128728 | consumed samples: 31504 | consumed tokens: 64520192 | elapsed time per iteration (s): 15.24 | learning rate: 1.032E-05 | global batch size: 16 | lm loss: 6.641015E+00 | grad norm: 0.771 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1970/ 128728 | consumed samples: 31520 | consumed tokens: 64552960 | elapsed time per iteration (s): 15.25 | learning rate: 1.033E-05 | global batch size: 16 | lm loss: 6.500915E+00 | grad norm: 0.885 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1971/ 128728 | consumed samples: 31536 | consumed tokens: 64585728 | elapsed time per iteration (s): 15.26 | learning rate: 1.033E-05 | global batch size: 16 | lm loss: 6.305531E+00 | grad norm: 0.858 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1972/ 128728 | consumed samples: 31552 | consumed tokens: 64618496 | elapsed time per iteration (s): 15.23 | learning rate: 1.034E-05 | global batch size: 16 | lm loss: 6.369489E+00 | grad norm: 0.907 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1973/ 128728 | consumed samples: 31568 | consumed tokens: 64651264 | elapsed time per iteration (s): 15.18 | learning rate: 1.034E-05 | global batch size: 16 | lm loss: 6.497954E+00 | grad norm: 0.815 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 1974/ 128728 | consumed samples: 31584 | consumed tokens: 64684032 | elapsed time per iteration (s): 15.21 | learning rate: 1.035E-05 | global batch size: 16 | lm loss: 6.460599E+00 | grad norm: 0.879 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1975/ 128728 | consumed samples: 31600 | consumed tokens: 64716800 | elapsed time per iteration (s): 15.20 | learning rate: 1.035E-05 | global batch size: 16 | lm loss: 6.474432E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1976/ 128728 | consumed samples: 31616 | consumed tokens: 64749568 | elapsed time per iteration (s): 15.21 | learning rate: 1.036E-05 | global batch size: 16 | lm loss: 6.461910E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1977/ 128728 | consumed samples: 31632 | consumed tokens: 64782336 | elapsed time per iteration (s): 15.22 | learning rate: 1.037E-05 | global batch size: 16 | lm loss: 6.431888E+00 | grad norm: 0.734 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1978/ 128728 | consumed samples: 31648 | consumed tokens: 64815104 | elapsed time per iteration (s): 15.22 | learning rate: 1.037E-05 | global batch size: 16 | lm loss: 6.392217E+00 | grad norm: 0.708 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1979/ 128728 | consumed samples: 31664 | consumed tokens: 64847872 | elapsed time per iteration (s): 15.23 | learning rate: 1.038E-05 | global batch size: 16 | lm loss: 6.331327E+00 | grad norm: 0.813 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1980/ 128728 | consumed samples: 31680 | consumed tokens: 64880640 | elapsed time per iteration (s): 15.25 | learning rate: 1.038E-05 | global batch size: 16 | lm loss: 6.728785E+00 | grad norm: 1.497 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1981/ 128728 | consumed samples: 31696 | consumed tokens: 64913408 | elapsed time per iteration (s): 15.21 | learning rate: 1.039E-05 | global batch size: 16 | lm loss: 6.497895E+00 | grad norm: 0.744 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1982/ 128728 | consumed samples: 31712 | consumed tokens: 64946176 | elapsed time per iteration (s): 15.24 | learning rate: 1.039E-05 | global batch size: 16 | lm loss: 6.456326E+00 | grad norm: 1.006 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1983/ 128728 | consumed samples: 31728 | consumed tokens: 64978944 | elapsed time per iteration (s): 15.20 | learning rate: 1.040E-05 | global batch size: 16 | lm loss: 6.607990E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1984/ 128728 | consumed samples: 31744 | consumed tokens: 65011712 | elapsed time per iteration (s): 15.22 | learning rate: 1.040E-05 | global batch size: 16 | lm loss: 6.539401E+00 | grad norm: 0.898 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1985/ 128728 | consumed samples: 31760 | consumed tokens: 65044480 | elapsed time per iteration (s): 15.21 | learning rate: 1.041E-05 | global batch size: 16 | lm loss: 6.522558E+00 | grad norm: 0.814 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1986/ 128728 | consumed samples: 31776 | consumed tokens: 65077248 | elapsed time per iteration (s): 15.21 | learning rate: 1.041E-05 | global batch size: 16 | lm loss: 6.358567E+00 | grad norm: 0.836 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 1987/ 128728 | consumed samples: 31792 | consumed tokens: 65110016 | elapsed time per iteration (s): 15.22 | learning rate: 1.042E-05 | global batch size: 16 | lm loss: 6.626979E+00 | grad norm: 0.772 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1988/ 128728 | consumed samples: 31808 | consumed tokens: 65142784 | elapsed time per iteration (s): 15.23 | learning rate: 1.042E-05 | global batch size: 16 | lm loss: 6.454780E+00 | grad norm: 0.886 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 1989/ 128728 | consumed samples: 31824 | consumed tokens: 65175552 | elapsed time per iteration (s): 15.23 | learning rate: 1.043E-05 | global batch size: 16 | lm loss: 6.659132E+00 | grad norm: 1.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1990/ 128728 | consumed samples: 31840 | consumed tokens: 65208320 | elapsed time per iteration (s): 15.23 | learning rate: 1.043E-05 | global batch size: 16 | lm loss: 6.639725E+00 | grad norm: 0.799 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1991/ 128728 | consumed samples: 31856 | consumed tokens: 65241088 | elapsed time per iteration (s): 15.21 | learning rate: 1.044E-05 | global batch size: 16 | lm loss: 6.386582E+00 | grad norm: 0.937 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 1992/ 128728 | consumed samples: 31872 | consumed tokens: 65273856 | elapsed time per iteration (s): 15.25 | learning rate: 1.044E-05 | global batch size: 16 | lm loss: 6.536162E+00 | grad norm: 1.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1993/ 128728 | consumed samples: 31888 | consumed tokens: 65306624 | elapsed time per iteration (s): 15.20 | learning rate: 1.045E-05 | global batch size: 16 | lm loss: 6.536993E+00 | grad norm: 0.801 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 1994/ 128728 | consumed samples: 31904 | consumed tokens: 65339392 | elapsed time per iteration (s): 15.23 | learning rate: 1.045E-05 | global batch size: 16 | lm loss: 6.499975E+00 | grad norm: 0.798 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1995/ 128728 | consumed samples: 31920 | consumed tokens: 65372160 | elapsed time per iteration (s): 15.23 | learning rate: 1.046E-05 | global batch size: 16 | lm loss: 6.507980E+00 | grad norm: 1.094 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1996/ 128728 | consumed samples: 31936 | consumed tokens: 65404928 | elapsed time per iteration (s): 15.24 | learning rate: 1.046E-05 | global batch size: 16 | lm loss: 6.813938E+00 | grad norm: 1.231 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 1997/ 128728 | consumed samples: 31952 | consumed tokens: 65437696 | elapsed time per iteration (s): 15.23 | learning rate: 1.047E-05 | global batch size: 16 | lm loss: 6.574802E+00 | grad norm: 0.912 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 1998/ 128728 | consumed samples: 31968 | consumed tokens: 65470464 | elapsed time per iteration (s): 15.25 | learning rate: 1.048E-05 | global batch size: 16 | lm loss: 6.272683E+00 | grad norm: 0.692 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 1999/ 128728 | consumed samples: 31984 | consumed tokens: 65503232 | elapsed time per iteration (s): 15.25 | learning rate: 1.048E-05 | global batch size: 16 | lm loss: 6.425203E+00 | grad norm: 1.376 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default0]:[2022-03-03 14:24:27,588] [INFO] [logging.py:69:log_dist] [Rank 0] step=2000, skipped=0, lr=[1.0485786843614321e-05, 1.0485786843614321e-05, 1.0485786843614321e-05], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)] [default0]:steps: 2000 loss: 6.4689 iter time (s): 13.947 samples/sec: 1.147 [default7]: iteration 2000/ 128728 | consumed samples: 32000 | consumed tokens: 65536000 | elapsed time per iteration (s): 15.22 | learning rate: 1.049E-05 | global batch size: 16 | lm loss: 6.468902E+00 | grad norm: 0.971 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default0]:saving checkpoint at iteration 2000 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default7]:------------------------------------------------------------------------------------------ [default7]:valid loss at iteration 2000 | lm loss value: 6.755413E+00 | lm loss PPL: 8.586940E+02 | [default7]:------------------------------------------------------------------------------------------ [default1]:[2022-03-03 14:24:56,077] [INFO] [logging.py:69:log_dist] [Rank 1] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/mp_rank_01_model_states.pt [default0]:[2022-03-03 14:24:56,084] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/mp_rank_00_model_states.pt [default1]:[2022-03-03 14:25:04,064] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_41_optim_states.pt [default5]:[2022-03-03 14:25:04,138] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_41_optim_states.pt [default3]:[2022-03-03 14:25:04,301] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_43_optim_states.pt [default0]:[2022-03-03 14:25:04,391] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_40_optim_states.pt [default1]:[2022-03-03 14:25:04,459] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_13_optim_states.pt [default5]:[2022-03-03 14:25:04,427] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_13_optim_states.pt [default2]:[2022-03-03 14:25:04,511] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_34_optim_states.pt [default1]:[2022-03-03 14:25:04,511] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_33_optim_states.pt [default2]:[2022-03-03 14:25:04,586] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_14_optim_states.pt [default3]:[2022-03-03 14:25:04,518] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_35_optim_states.pt [default2]:[2022-03-03 14:25:04,653] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_42_optim_states.pt [default5]:[2022-03-03 14:25:04,642] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_33_optim_states.pt [default3]:[2022-03-03 14:25:04,787] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_43_optim_states.pt [default4]:[2022-03-03 14:25:04,757] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_32_optim_states.pt [default4]:[2022-03-03 14:25:04,801] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_12_optim_states.pt [default6]:[2022-03-03 14:25:04,783] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_42_optim_states.pt [default3]:[2022-03-03 14:25:04,750] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_15_optim_states.pt [default0]:[2022-03-03 14:25:04,824] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_12_optim_states.pt [default4]:[2022-03-03 14:25:04,819] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_36_optim_states.pt [default7]:[2022-03-03 14:25:04,790] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_43_optim_states.pt [default6]:[2022-03-03 14:25:04,968] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_34_optim_states.pt [default6]:[2022-03-03 14:25:04,926] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_42_optim_states.pt [default4]:[2022-03-03 14:25:04,927] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_40_optim_states.pt [default2]:[2022-03-03 14:25:05,016] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_42_optim_states.pt [default7]:[2022-03-03 14:25:05,013] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_35_optim_states.pt [default5]:[2022-03-03 14:25:05,066] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_37_optim_states.pt [default7]:[2022-03-03 14:25:05,061] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_15_optim_states.pt [default1]:[2022-03-03 14:25:05,176] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_41_optim_states.pt [default7]:[2022-03-03 14:25:05,131] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_43_optim_states.pt [default5]:[2022-03-03 14:25:05,182] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_41_optim_states.pt [default0]:[2022-03-03 14:25:05,252] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_32_optim_states.pt [default1]:[2022-03-03 14:25:05,282] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_37_optim_states.pt [default0]:[2022-03-03 14:25:05,306] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_36_optim_states.pt [default6]:[2022-03-03 14:25:05,332] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_14_optim_states.pt [default4]:[2022-03-03 14:25:05,402] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_40_optim_states.pt [default7]:[2022-03-03 14:25:05,525] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_39_optim_states.pt [default6]:[2022-03-03 14:25:05,687] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_38_optim_states.pt [default2]:[2022-03-03 14:25:05,712] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_38_optim_states.pt [default3]:[2022-03-03 14:25:05,705] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_39_optim_states.pt [default0]:[2022-03-03 14:25:05,882] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_40_optim_states.pt [default0]:[2022-03-03 14:25:06,466] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_08_optim_states.pt [default0]:[2022-03-03 14:25:06,433] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_24_optim_states.pt [default3]:[2022-03-03 14:25:06,491] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_31_optim_states.pt [default6]:[2022-03-03 14:25:06,502] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_26_optim_states.pt [default2]:[2022-03-03 14:25:06,572] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_30_optim_states.pt [default7]:[2022-03-03 14:25:06,624] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_27_optim_states.pt [default1]:[2022-03-03 14:25:06,808] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_25_optim_states.pt [default5]:[2022-03-03 14:25:06,764] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_25_optim_states.pt [default2]:[2022-03-03 14:25:06,889] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_26_optim_states.pt [default4]:[2022-03-03 14:25:06,979] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_24_optim_states.pt [default1]:[2022-03-03 14:25:07,080] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_17_optim_states.pt [default6]:[2022-03-03 14:25:07,364] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_10_optim_states.pt [default4]:[2022-03-03 14:25:07,371] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_28_optim_states.pt [default3]:[2022-03-03 14:25:07,511] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_43_optim_states.pt [default6]:[2022-03-03 14:25:07,536] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_30_optim_states.pt [default7]:[2022-03-03 14:25:07,530] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_15_optim_states.pt [default4]:[2022-03-03 14:25:07,516] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_16_optim_states.pt [default0]:[2022-03-03 14:25:07,533] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_08_optim_states.pt [default4]:[2022-03-03 14:25:07,554] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_40_optim_states.pt [default5]:[2022-03-03 14:25:07,620] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_13_optim_states.pt [default4]:[2022-03-03 14:25:07,608] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_12_optim_states.pt [default1]:[2022-03-03 14:25:07,625] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_29_optim_states.pt [default5]:[2022-03-03 14:25:07,762] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_29_optim_states.pt [default3]:[2022-03-03 14:25:07,788] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_27_optim_states.pt [default6]:[2022-03-03 14:25:07,750] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_14_optim_states.pt [default0]:[2022-03-03 14:25:07,845] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_28_optim_states.pt [default7]:[2022-03-03 14:25:07,849] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_31_optim_states.pt [default3]:[2022-03-03 14:25:07,816] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_11_optim_states.pt [default3]:[2022-03-03 14:25:08,058] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_15_optim_states.pt [default1]:[2022-03-03 14:25:08,049] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_09_optim_states.pt [default7]:[2022-03-03 14:25:08,111] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_11_optim_states.pt [default7]:[2022-03-03 14:25:08,136] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_39_optim_states.pt [default4]:[2022-03-03 14:25:08,213] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_44_optim_states.pt [default5]:[2022-03-03 14:25:08,314] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_09_optim_states.pt [default4]:[2022-03-03 14:25:08,408] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_08_optim_states.pt [default5]:[2022-03-03 14:25:08,585] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_37_optim_states.pt [default7]:[2022-03-03 14:25:08,597] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_31_optim_states.pt [default2]:[2022-03-03 14:25:08,634] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_10_optim_states.pt [default1]:[2022-03-03 14:25:08,642] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_37_optim_states.pt [default1]:[2022-03-03 14:25:08,793] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_45_optim_states.pt [default2]:[2022-03-03 14:25:08,948] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_38_optim_states.pt [default5]:[2022-03-03 14:25:08,881] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_45_optim_states.pt [default6]:[2022-03-03 14:25:08,987] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_10_optim_states.pt [default0]:[2022-03-03 14:25:08,882] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_44_optim_states.pt [default3]:[2022-03-03 14:25:08,977] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_39_optim_states.pt [default1]:[2022-03-03 14:25:09,016] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_09_optim_states.pt [default4]:[2022-03-03 14:25:09,220] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt [default4]:[2022-03-03 14:25:09,279] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_36_optim_states.pt [default0]:[2022-03-03 14:25:09,341] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_36_optim_states.pt [default5]:[2022-03-03 14:25:09,316] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_41_optim_states.pt [default3]:[2022-03-03 14:25:09,431] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_11_optim_states.pt [default0]:[2022-03-03 14:25:09,286] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt [default6]:[2022-03-03 14:25:09,413] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_38_optim_states.pt [default0]:[2022-03-03 14:25:09,549] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_16_optim_states.pt [default2]:[2022-03-03 14:25:09,652] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_14_optim_states.pt [default5]:[2022-03-03 14:25:09,608] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_17_optim_states.pt [default1]:[2022-03-03 14:25:09,690] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_21_optim_states.pt [default0]:[2022-03-03 14:25:09,692] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_12_optim_states.pt [default2]:[2022-03-03 14:25:09,665] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_18_optim_states.pt [default7]:[2022-03-03 14:25:09,723] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_11_optim_states.pt [default3]:[2022-03-03 14:25:09,756] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_31_optim_states.pt [default4]:[2022-03-03 14:25:09,848] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_08_optim_states.pt [default2]:[2022-03-03 14:25:09,872] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_30_optim_states.pt [default6]:[2022-03-03 14:25:09,961] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_46_optim_states.pt [default1]:[2022-03-03 14:25:09,929] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_29_optim_states.pt [default0]:[2022-03-03 14:25:10,033] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_20_optim_states.pt [default6]:[2022-03-03 14:25:09,992] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_18_optim_states.pt [default2]:[2022-03-03 14:25:09,998] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_10_optim_states.pt [default0]:[2022-03-03 14:25:10,045] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_28_optim_states.pt [default2]:[2022-03-03 14:25:10,078] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_30_optim_states.pt [default5]:[2022-03-03 14:25:10,119] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt [default3]:[2022-03-03 14:25:10,191] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_31_optim_states.pt [default3]:[2022-03-03 14:25:10,243] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_19_optim_states.pt [default6]:[2022-03-03 14:25:10,185] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_30_optim_states.pt [default5]:[2022-03-03 14:25:10,187] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_09_optim_states.pt [default0]:[2022-03-03 14:25:10,210] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_16_optim_states.pt [default2]:[2022-03-03 14:25:10,302] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_42_optim_states.pt [default0]:[2022-03-03 14:25:10,327] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_12_optim_states.pt [default0]:[2022-03-03 14:25:10,407] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_04_optim_states.pt [default4]:[2022-03-03 14:25:10,428] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_32_optim_states.pt [default7]:[2022-03-03 14:25:10,483] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_15_optim_states.pt [default7]:[2022-03-03 14:25:10,608] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_47_optim_states.pt [default1]:[2022-03-03 14:25:10,741] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_41_optim_states.pt [default7]:[2022-03-03 14:25:10,707] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_19_optim_states.pt [default7]:[2022-03-03 14:25:10,821] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_47_optim_states.pt [default0]:[2022-03-03 14:25:10,841] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_44_optim_states.pt [default1]:[2022-03-03 14:25:10,876] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_45_optim_states.pt [default6]:[2022-03-03 14:25:10,851] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_42_optim_states.pt [default1]:[2022-03-03 14:25:10,965] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_13_optim_states.pt [default6]:[2022-03-03 14:25:10,913] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_46_optim_states.pt [default0]:[2022-03-03 14:25:10,958] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_40_optim_states.pt [default1]:[2022-03-03 14:25:11,149] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_13_optim_states.pt [default1]:[2022-03-03 14:25:11,030] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_17_optim_states.pt [default2]:[2022-03-03 14:25:11,148] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_46_optim_states.pt [default7]:[2022-03-03 14:25:11,146] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_43_optim_states.pt [default4]:[2022-03-03 14:25:11,209] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_04_optim_states.pt [default1]:[2022-03-03 14:25:11,182] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_05_optim_states.pt [default7]:[2022-03-03 14:25:11,228] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_35_optim_states.pt [default4]:[2022-03-03 14:25:11,327] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_20_optim_states.pt [default3]:[2022-03-03 14:25:11,359] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_47_optim_states.pt [default4]:[2022-03-03 14:25:11,436] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt [default7]:[2022-03-03 14:25:11,382] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_43_optim_states.pt [default5]:[2022-03-03 14:25:11,453] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_01_optim_states.pt [default5]:[2022-03-03 14:25:11,466] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_33_optim_states.pt [default5]:[2022-03-03 14:25:11,474] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_13_optim_states.pt [default5]:[2022-03-03 14:25:11,513] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_05_optim_states.pt [default1]:[2022-03-03 14:25:11,458] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_09_optim_states.pt [default2]:[2022-03-03 14:25:11,451] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_34_optim_states.pt [default5]:[2022-03-03 14:25:11,552] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_21_optim_states.pt [default0]:[2022-03-03 14:25:11,573] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_08_optim_states.pt [default3]:[2022-03-03 14:25:11,603] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_23_optim_states.pt [default0]:[2022-03-03 14:25:11,694] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_32_optim_states.pt [default2]:[2022-03-03 14:25:11,689] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_34_optim_states.pt [default4]:[2022-03-03 14:25:11,716] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_12_optim_states.pt [default1]:[2022-03-03 14:25:11,726] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_13_optim_states.pt [default1]:[2022-03-03 14:25:11,752] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_33_optim_states.pt [default6]:[2022-03-03 14:25:11,834] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_22_optim_states.pt [default1]:[2022-03-03 14:25:11,883] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_01_optim_states.pt [default7]:[2022-03-03 14:25:11,852] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_23_optim_states.pt [default2]:[2022-03-03 14:25:11,893] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_46_optim_states.pt [default2]:[2022-03-03 14:25:11,931] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_06_optim_states.pt [default6]:[2022-03-03 14:25:12,003] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_14_optim_states.pt [default3]:[2022-03-03 14:25:11,986] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_47_optim_states.pt [default3]:[2022-03-03 14:25:12,010] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_35_optim_states.pt [default2]:[2022-03-03 14:25:12,035] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_22_optim_states.pt [default4]:[2022-03-03 14:25:12,043] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_12_optim_states.pt [default6]:[2022-03-03 14:25:12,021] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_34_optim_states.pt [default0]:[2022-03-03 14:25:12,083] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt [default0]:[2022-03-03 14:25:12,118] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_24_optim_states.pt [default5]:[2022-03-03 14:25:12,069] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_45_optim_states.pt [default2]:[2022-03-03 14:25:12,192] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_14_optim_states.pt [default7]:[2022-03-03 14:25:12,180] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_03_optim_states.pt [default2]:[2022-03-03 14:25:12,244] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_18_optim_states.pt [default6]:[2022-03-03 14:25:12,221] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_02_optim_states.pt [default3]:[2022-03-03 14:25:12,321] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_15_optim_states.pt [default4]:[2022-03-03 14:25:12,355] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_16_optim_states.pt [default0]:[2022-03-03 14:25:12,461] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [default3]:[2022-03-03 14:25:12,522] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_03_optim_states.pt [default7]:[2022-03-03 14:25:12,533] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_39_optim_states.pt [default2]:[2022-03-03 14:25:12,540] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_02_optim_states.pt [default0]:[2022-03-03 14:25:12,402] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_20_optim_states.pt [default5]:[2022-03-03 14:25:12,559] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_13_optim_states.pt [default3]:[2022-03-03 14:25:12,656] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_19_optim_states.pt [default2]:[2022-03-03 14:25:12,689] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_22_optim_states.pt [default4]:[2022-03-03 14:25:12,726] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_44_optim_states.pt [default1]:[2022-03-03 14:25:12,717] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_29_optim_states.pt [default1]:[2022-03-03 14:25:12,756] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_01_optim_states.pt [default1]:[2022-03-03 14:25:12,734] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_21_optim_states.pt [default4]:[2022-03-03 14:25:12,766] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_04_optim_states.pt [default0]:[2022-03-03 14:25:12,819] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_20_optim_states.pt [default2]:[2022-03-03 14:25:12,824] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_14_optim_states.pt [default2]:[2022-03-03 14:25:12,801] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_18_optim_states.pt [default3]:[2022-03-03 14:25:12,801] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_07_optim_states.pt [default3]:[2022-03-03 14:25:12,778] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_23_optim_states.pt [default2]:[2022-03-03 14:25:12,845] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_26_optim_states.pt [default5]:[2022-03-03 14:25:12,859] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_29_optim_states.pt [default5]:[2022-03-03 14:25:12,876] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_17_optim_states.pt [default2]:[2022-03-03 14:25:12,887] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_06_optim_states.pt [default3]:[2022-03-03 14:25:12,877] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_07_optim_states.pt [default5]:[2022-03-03 14:25:12,944] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_05_optim_states.pt [default5]:[2022-03-03 14:25:12,998] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_17_optim_states.pt [default0]:[2022-03-03 14:25:12,953] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_12_optim_states.pt [default3]:[2022-03-03 14:25:12,957] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_27_optim_states.pt [default6]:[2022-03-03 14:25:13,109] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_42_optim_states.pt [default7]:[2022-03-03 14:25:13,128] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_35_optim_states.pt [default0]:[2022-03-03 14:25:13,101] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_32_optim_states.pt [default6]:[2022-03-03 14:25:13,131] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_18_optim_states.pt [default6]:[2022-03-03 14:25:13,217] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_10_optim_states.pt [default1]:[2022-03-03 14:25:13,239] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_33_optim_states.pt [default6]:[2022-03-03 14:25:13,235] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_06_optim_states.pt [default0]:[2022-03-03 14:25:13,277] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_28_optim_states.pt [default4]:[2022-03-03 14:25:13,231] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_08_optim_states.pt [default7]:[2022-03-03 14:25:13,290] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_07_optim_states.pt [default1]:[2022-03-03 14:25:13,236] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_25_optim_states.pt [default1]:[2022-03-03 14:25:13,454] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_21_optim_states.pt [default6]:[2022-03-03 14:25:13,599] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_02_optim_states.pt [default3]:[2022-03-03 14:25:13,622] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_15_optim_states.pt [default7]:[2022-03-03 14:25:13,654] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_03_optim_states.pt [default5]:[2022-03-03 14:25:13,639] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_37_optim_states.pt [default3]:[2022-03-03 14:25:13,734] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_03_optim_states.pt [default1]:[2022-03-03 14:25:13,725] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_01_optim_states.pt [default5]:[2022-03-03 14:25:13,819] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_09_optim_states.pt [default7]:[2022-03-03 14:25:13,783] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_19_optim_states.pt [default6]:[2022-03-03 14:25:13,788] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_02_optim_states.pt [default4]:[2022-03-03 14:25:13,884] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_28_optim_states.pt [default7]:[2022-03-03 14:25:13,870] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_11_optim_states.pt [default4]:[2022-03-03 14:25:13,823] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_16_optim_states.pt [default2]:[2022-03-03 14:25:13,836] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_38_optim_states.pt [default5]:[2022-03-03 14:25:13,936] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_29_optim_states.pt [default7]:[2022-03-03 14:25:13,925] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_03_optim_states.pt [default3]:[2022-03-03 14:25:13,911] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_39_optim_states.pt [default2]:[2022-03-03 14:25:13,977] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_10_optim_states.pt [default2]:[2022-03-03 14:25:13,977] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_02_optim_states.pt [default3]:[2022-03-03 14:25:14,059] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_43_optim_states.pt [default4]:[2022-03-03 14:25:13,985] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_24_optim_states.pt [default0]:[2022-03-03 14:25:14,020] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_24_optim_states.pt [default6]:[2022-03-03 14:25:14,056] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_38_optim_states.pt [default2]:[2022-03-03 14:25:14,096] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_42_optim_states.pt [default1]:[2022-03-03 14:25:14,107] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_25_optim_states.pt [default3]:[2022-03-03 14:25:14,075] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_27_optim_states.pt [default7]:[2022-03-03 14:25:14,127] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_39_optim_states.pt [default3]:[2022-03-03 14:25:14,160] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_11_optim_states.pt [default5]:[2022-03-03 14:25:14,130] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_41_optim_states.pt [default0]:[2022-03-03 14:25:14,216] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_32_optim_states.pt [default2]:[2022-03-03 14:25:14,104] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_26_optim_states.pt [default4]:[2022-03-03 14:25:14,140] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_24_optim_states.pt [default5]:[2022-03-03 14:25:14,153] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_25_optim_states.pt [default3]:[2022-03-03 14:25:14,192] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_07_optim_states.pt [default1]:[2022-03-03 14:25:14,311] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_29_optim_states.pt [default0]:[2022-03-03 14:25:14,314] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_36_optim_states.pt [default0]:[2022-03-03 14:25:14,331] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_28_optim_states.pt [default6]:[2022-03-03 14:25:14,368] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_26_optim_states.pt [default7]:[2022-03-03 14:25:14,475] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_27_optim_states.pt [default1]:[2022-03-03 14:25:14,460] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_17_optim_states.pt [default1]:[2022-03-03 14:25:14,523] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_41_optim_states.pt [default5]:[2022-03-03 14:25:14,536] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_25_optim_states.pt [default1]:[2022-03-03 14:25:14,577] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_37_optim_states.pt [default0]:[2022-03-03 14:25:14,531] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_04_optim_states.pt [default4]:[2022-03-03 14:25:14,644] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_36_optim_states.pt [default0]:[2022-03-03 14:25:14,668] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_20_optim_states.pt [default5]:[2022-03-03 14:25:14,760] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_21_optim_states.pt [default7]:[2022-03-03 14:25:14,683] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_27_optim_states.pt [default3]:[2022-03-03 14:25:14,704] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_27_optim_states.pt [default4]:[2022-03-03 14:25:14,755] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_28_optim_states.pt [default1]:[2022-03-03 14:25:14,776] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_05_optim_states.pt [default4]:[2022-03-03 14:25:14,748] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_40_optim_states.pt [default2]:[2022-03-03 14:25:14,804] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_10_optim_states.pt [default2]:[2022-03-03 14:25:14,827] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_22_optim_states.pt [default3]:[2022-03-03 14:25:14,768] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_23_optim_states.pt [default5]:[2022-03-03 14:25:14,790] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_25_optim_states.pt [default4]:[2022-03-03 14:25:14,888] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_04_optim_states.pt [default2]:[2022-03-03 14:25:14,834] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_34_optim_states.pt [default1]:[2022-03-03 14:25:14,879] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_21_optim_states.pt [default7]:[2022-03-03 14:25:14,919] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_15_optim_states.pt [default6]:[2022-03-03 14:25:14,929] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_14_optim_states.pt [default3]:[2022-03-03 14:25:14,951] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_11_optim_states.pt [default7]:[2022-03-03 14:25:15,004] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_23_optim_states.pt [default3]:[2022-03-03 14:25:15,050] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_35_optim_states.pt [default0]:[2022-03-03 14:25:15,107] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_16_optim_states.pt [default7]:[2022-03-03 14:25:15,128] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_07_optim_states.pt [default5]:[2022-03-03 14:25:15,176] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_33_optim_states.pt [default0]:[2022-03-03 14:25:15,167] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_40_optim_states.pt [default6]:[2022-03-03 14:25:15,226] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_30_optim_states.pt [default6]:[2022-03-03 14:25:15,197] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_26_optim_states.pt [default6]:[2022-03-03 14:25:15,265] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_06_optim_states.pt [default7]:[2022-03-03 14:25:15,195] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_31_optim_states.pt [default2]:[2022-03-03 14:25:15,289] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_06_optim_states.pt [default3]:[2022-03-03 14:25:15,325] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_35_optim_states.pt [default6]:[2022-03-03 14:25:15,334] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_38_optim_states.pt [default3]:[2022-03-03 14:25:15,341] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_39_optim_states.pt [default0]:[2022-03-03 14:25:15,405] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_04_optim_states.pt [default1]:[2022-03-03 14:25:15,352] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_37_optim_states.pt [default3]:[2022-03-03 14:25:15,503] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_31_optim_states.pt [default6]:[2022-03-03 14:25:15,453] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_22_optim_states.pt [default2]:[2022-03-03 14:25:15,553] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_18_optim_states.pt [default7]:[2022-03-03 14:25:15,543] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_47_optim_states.pt [default7]:[2022-03-03 14:25:15,568] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_31_optim_states.pt [default4]:[2022-03-03 14:25:15,547] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_20_optim_states.pt [default6]:[2022-03-03 14:25:15,566] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_46_optim_states.pt [default3]:[2022-03-03 14:25:15,620] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_19_optim_states.pt [default2]:[2022-03-03 14:25:15,647] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_26_optim_states.pt [default5]:[2022-03-03 14:25:15,620] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_05_optim_states.pt [default5]:[2022-03-03 14:25:15,693] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_01_optim_states.pt [default4]:[2022-03-03 14:25:15,762] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt [default1]:[2022-03-03 14:25:15,711] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_01_optim_states.pt [default4]:[2022-03-03 14:25:15,744] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt [default4]:[2022-03-03 14:25:15,744] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_32_optim_states.pt [default1]:[2022-03-03 14:25:15,820] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_33_optim_states.pt [default7]:[2022-03-03 14:25:15,890] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_19_optim_states.pt [default0]:[2022-03-03 14:25:15,880] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_44_optim_states.pt [default4]:[2022-03-03 14:25:15,875] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_24_optim_states.pt [default3]:[2022-03-03 14:25:15,909] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_47_optim_states.pt [default1]:[2022-03-03 14:25:15,968] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_45_optim_states.pt [default3]:[2022-03-03 14:25:16,027] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_03_optim_states.pt [default3]:[2022-03-03 14:25:15,979] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_03_optim_states.pt [default6]:[2022-03-03 14:25:15,989] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_18_optim_states.pt [default0]:[2022-03-03 14:25:16,092] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_44_optim_states.pt [default3]:[2022-03-03 14:25:16,058] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_47_optim_states.pt [default5]:[2022-03-03 14:25:16,205] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_01_optim_states.pt [default6]:[2022-03-03 14:25:16,210] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_46_optim_states.pt [default6]:[2022-03-03 14:25:16,294] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_10_optim_states.pt [default6]:[2022-03-03 14:25:16,302] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_34_optim_states.pt [default2]:[2022-03-03 14:25:16,286] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_02_optim_states.pt [default7]:[2022-03-03 14:25:16,399] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_11_optim_states.pt [default3]:[2022-03-03 14:25:16,420] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_19_optim_states.pt [default1]:[2022-03-03 14:25:16,511] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_17_optim_states.pt [default6]:[2022-03-03 14:25:16,509] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_06_optim_states.pt [default0]:[2022-03-03 14:25:16,582] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt [default6]:[2022-03-03 14:25:16,580] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_18_optim_states.pt [default2]:[2022-03-03 14:25:16,561] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_02_optim_states.pt [default7]:[2022-03-03 14:25:16,632] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_07_optim_states.pt [default5]:[2022-03-03 14:25:16,626] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_09_optim_states.pt [default2]:[2022-03-03 14:25:16,720] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_38_optim_states.pt [default2]:[2022-03-03 14:25:16,786] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_46_optim_states.pt [default5]:[2022-03-03 14:25:16,880] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_33_optim_states.pt [default4]:[2022-03-03 14:25:16,865] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_32_optim_states.pt [default7]:[2022-03-03 14:25:16,982] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_47_optim_states.pt [default1]:[2022-03-03 14:25:17,002] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_25_optim_states.pt [default0]:[2022-03-03 14:25:17,204] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_36_optim_states.pt [default7]:[2022-03-03 14:25:17,303] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_19_optim_states.pt [default1]:[2022-03-03 14:25:17,428] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_05_optim_states.pt [default6]:[2022-03-03 14:25:17,465] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_06_optim_states.pt [default7]:[2022-03-03 14:25:17,519] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_07_optim_states.pt [default0]:[2022-03-03 14:25:17,564] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_24_optim_states.pt [default4]:[2022-03-03 14:25:17,653] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_08_optim_states.pt [default1]:[2022-03-03 14:25:17,735] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_45_optim_states.pt [default6]:[2022-03-03 14:25:17,932] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_26_optim_states.pt [default7]:[2022-03-03 14:25:18,080] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_27_optim_states.pt [default0]:[2022-03-03 14:25:18,108] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_08_optim_states.pt [default0]:[2022-03-03 14:25:18,135] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_04_optim_states.pt [default1]:[2022-03-03 14:25:18,120] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_0_mp_rank_09_optim_states.pt [default1]:[2022-03-03 14:25:18,166] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_2_mp_rank_05_optim_states.pt [default2]:[2022-03-03 14:25:18,915] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_22_optim_states.pt [default5]:[2022-03-03 14:25:18,914] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_37_optim_states.pt [default4]:[2022-03-03 14:25:18,900] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_36_optim_states.pt [default6]:[2022-03-03 14:25:19,123] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_30_optim_states.pt [default7]:[2022-03-03 14:25:19,144] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_03_optim_states.pt [default0]:[2022-03-03 14:25:19,132] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_16_optim_states.pt [default6]:[2022-03-03 14:25:19,180] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_02_optim_states.pt [default2]:[2022-03-03 14:25:19,183] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_30_optim_states.pt [default5]:[2022-03-03 14:25:19,362] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_45_optim_states.pt [default4]:[2022-03-03 14:25:19,499] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_3_mp_rank_44_optim_states.pt [default3]:[2022-03-03 14:25:19,861] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_07_optim_states.pt [default2]:[2022-03-03 14:25:19,829] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_4_mp_rank_46_optim_states.pt [default5]:[2022-03-03 14:25:20,020] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_21_optim_states.pt [default4]:[2022-03-03 14:25:20,139] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_20_optim_states.pt [default4]:[2022-03-03 14:25:20,377] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_16_optim_states.pt [default5]:[2022-03-03 14:25:20,522] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_17_optim_states.pt [default7]:[2022-03-03 14:25:20,535] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_23_optim_states.pt [default6]:[2022-03-03 14:25:20,521] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_22_optim_states.pt [default3]:[2022-03-03 14:25:20,693] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_23_optim_states.pt [default2]:[2022-03-03 14:25:20,721] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_6_mp_rank_06_optim_states.pt [default5]:[2022-03-03 14:25:21,334] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_05_optim_states.pt [default7]:[2022-03-03 14:25:21,646] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_35_optim_states.pt [default6]:[2022-03-03 14:25:21,731] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_1_mp_rank_34_optim_states.pt [default7]:[2022-03-03 14:25:21,968] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_23_optim_states.pt [default4]:[2022-03-03 14:25:22,070] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_04_optim_states.pt [default6]:[2022-03-03 14:25:22,196] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_22_optim_states.pt [default4]:[2022-03-03 14:25:22,649] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_28_optim_states.pt [default5]:[2022-03-03 14:25:22,685] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_7_mp_rank_29_optim_states.pt [default5]:[2022-03-03 14:25:22,785] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_45_optim_states.pt [default4]:[2022-03-03 14:25:22,873] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_44_optim_states.pt [default4]:[2022-03-03 14:25:26,948] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_20_optim_states.pt [default0]: successfully saved checkpoint at iteration 2000 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default7]:time (ms) | save-checkpoint: 38895.48 [default5]:[2022-03-03 14:25:26,963] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2000/bf16_zero_pp_rank_5_mp_rank_21_optim_states.pt [default7]: iteration 2001/ 128728 | consumed samples: 32016 | consumed tokens: 65568768 | elapsed time per iteration (s): 73.65 | learning rate: 1.049E-05 | global batch size: 16 | lm loss: 6.402064E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 0.217 | TFLOPs: 1.66 | [default7]: iteration 2002/ 128728 | consumed samples: 32032 | consumed tokens: 65601536 | elapsed time per iteration (s): 15.23 | learning rate: 1.050E-05 | global batch size: 16 | lm loss: 6.597949E+00 | grad norm: 0.767 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2003/ 128728 | consumed samples: 32048 | consumed tokens: 65634304 | elapsed time per iteration (s): 15.23 | learning rate: 1.050E-05 | global batch size: 16 | lm loss: 6.418831E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2004/ 128728 | consumed samples: 32064 | consumed tokens: 65667072 | elapsed time per iteration (s): 15.19 | learning rate: 1.051E-05 | global batch size: 16 | lm loss: 6.606805E+00 | grad norm: 1.369 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2005/ 128728 | consumed samples: 32080 | consumed tokens: 65699840 | elapsed time per iteration (s): 15.23 | learning rate: 1.051E-05 | global batch size: 16 | lm loss: 6.370827E+00 | grad norm: 0.776 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2006/ 128728 | consumed samples: 32096 | consumed tokens: 65732608 | elapsed time per iteration (s): 15.23 | learning rate: 1.052E-05 | global batch size: 16 | lm loss: 6.308137E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2007/ 128728 | consumed samples: 32112 | consumed tokens: 65765376 | elapsed time per iteration (s): 15.21 | learning rate: 1.052E-05 | global batch size: 16 | lm loss: 6.523125E+00 | grad norm: 0.832 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2008/ 128728 | consumed samples: 32128 | consumed tokens: 65798144 | elapsed time per iteration (s): 15.20 | learning rate: 1.053E-05 | global batch size: 16 | lm loss: 6.829843E+00 | grad norm: 0.974 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2009/ 128728 | consumed samples: 32144 | consumed tokens: 65830912 | elapsed time per iteration (s): 15.24 | learning rate: 1.053E-05 | global batch size: 16 | lm loss: 6.465959E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2010/ 128728 | consumed samples: 32160 | consumed tokens: 65863680 | elapsed time per iteration (s): 15.23 | learning rate: 1.054E-05 | global batch size: 16 | lm loss: 6.585162E+00 | grad norm: 0.822 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2011/ 128728 | consumed samples: 32176 | consumed tokens: 65896448 | elapsed time per iteration (s): 15.16 | learning rate: 1.054E-05 | global batch size: 16 | lm loss: 6.360588E+00 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 2012/ 128728 | consumed samples: 32192 | consumed tokens: 65929216 | elapsed time per iteration (s): 15.22 | learning rate: 1.055E-05 | global batch size: 16 | lm loss: 6.488918E+00 | grad norm: 0.830 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2013/ 128728 | consumed samples: 32208 | consumed tokens: 65961984 | elapsed time per iteration (s): 15.19 | learning rate: 1.055E-05 | global batch size: 16 | lm loss: 6.635891E+00 | grad norm: 0.799 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 2014/ 128728 | consumed samples: 32224 | consumed tokens: 65994752 | elapsed time per iteration (s): 15.24 | learning rate: 1.056E-05 | global batch size: 16 | lm loss: 6.583560E+00 | grad norm: 0.848 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2015/ 128728 | consumed samples: 32240 | consumed tokens: 66027520 | elapsed time per iteration (s): 15.21 | learning rate: 1.056E-05 | global batch size: 16 | lm loss: 6.287863E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2016/ 128728 | consumed samples: 32256 | consumed tokens: 66060288 | elapsed time per iteration (s): 15.19 | learning rate: 1.057E-05 | global batch size: 16 | lm loss: 6.449275E+00 | grad norm: 0.766 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2017/ 128728 | consumed samples: 32272 | consumed tokens: 66093056 | elapsed time per iteration (s): 15.22 | learning rate: 1.057E-05 | global batch size: 16 | lm loss: 6.813572E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2018/ 128728 | consumed samples: 32288 | consumed tokens: 66125824 | elapsed time per iteration (s): 15.21 | learning rate: 1.058E-05 | global batch size: 16 | lm loss: 6.464739E+00 | grad norm: 1.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2019/ 128728 | consumed samples: 32304 | consumed tokens: 66158592 | elapsed time per iteration (s): 15.20 | learning rate: 1.059E-05 | global batch size: 16 | lm loss: 6.490543E+00 | grad norm: 0.823 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2020/ 128728 | consumed samples: 32320 | consumed tokens: 66191360 | elapsed time per iteration (s): 15.17 | learning rate: 1.059E-05 | global batch size: 16 | lm loss: 6.522612E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2021/ 128728 | consumed samples: 32336 | consumed tokens: 66224128 | elapsed time per iteration (s): 15.22 | learning rate: 1.060E-05 | global batch size: 16 | lm loss: 6.463878E+00 | grad norm: 0.811 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2022/ 128728 | consumed samples: 32352 | consumed tokens: 66256896 | elapsed time per iteration (s): 15.22 | learning rate: 1.060E-05 | global batch size: 16 | lm loss: 6.588681E+00 | grad norm: 0.946 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2023/ 128728 | consumed samples: 32368 | consumed tokens: 66289664 | elapsed time per iteration (s): 15.26 | learning rate: 1.061E-05 | global batch size: 16 | lm loss: 6.585972E+00 | grad norm: 1.220 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2024/ 128728 | consumed samples: 32384 | consumed tokens: 66322432 | elapsed time per iteration (s): 15.25 | learning rate: 1.061E-05 | global batch size: 16 | lm loss: 6.484285E+00 | grad norm: 0.806 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2025/ 128728 | consumed samples: 32400 | consumed tokens: 66355200 | elapsed time per iteration (s): 15.21 | learning rate: 1.062E-05 | global batch size: 16 | lm loss: 6.319049E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2026/ 128728 | consumed samples: 32416 | consumed tokens: 66387968 | elapsed time per iteration (s): 15.23 | learning rate: 1.062E-05 | global batch size: 16 | lm loss: 6.435322E+00 | grad norm: 1.073 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2027/ 128728 | consumed samples: 32432 | consumed tokens: 66420736 | elapsed time per iteration (s): 15.24 | learning rate: 1.063E-05 | global batch size: 16 | lm loss: 6.357363E+00 | grad norm: 0.845 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2028/ 128728 | consumed samples: 32448 | consumed tokens: 66453504 | elapsed time per iteration (s): 15.23 | learning rate: 1.063E-05 | global batch size: 16 | lm loss: 6.541761E+00 | grad norm: 0.842 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2029/ 128728 | consumed samples: 32464 | consumed tokens: 66486272 | elapsed time per iteration (s): 15.23 | learning rate: 1.064E-05 | global batch size: 16 | lm loss: 6.403821E+00 | grad norm: 1.879 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2030/ 128728 | consumed samples: 32480 | consumed tokens: 66519040 | elapsed time per iteration (s): 15.21 | learning rate: 1.064E-05 | global batch size: 16 | lm loss: 6.531659E+00 | grad norm: 1.442 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2031/ 128728 | consumed samples: 32496 | consumed tokens: 66551808 | elapsed time per iteration (s): 15.24 | learning rate: 1.065E-05 | global batch size: 16 | lm loss: 6.443928E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2032/ 128728 | consumed samples: 32512 | consumed tokens: 66584576 | elapsed time per iteration (s): 15.22 | learning rate: 1.065E-05 | global batch size: 16 | lm loss: 6.522864E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2033/ 128728 | consumed samples: 32528 | consumed tokens: 66617344 | elapsed time per iteration (s): 15.22 | learning rate: 1.066E-05 | global batch size: 16 | lm loss: 6.443838E+00 | grad norm: 0.963 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2034/ 128728 | consumed samples: 32544 | consumed tokens: 66650112 | elapsed time per iteration (s): 15.23 | learning rate: 1.066E-05 | global batch size: 16 | lm loss: 6.476315E+00 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2035/ 128728 | consumed samples: 32560 | consumed tokens: 66682880 | elapsed time per iteration (s): 15.24 | learning rate: 1.067E-05 | global batch size: 16 | lm loss: 6.310287E+00 | grad norm: 0.907 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2036/ 128728 | consumed samples: 32576 | consumed tokens: 66715648 | elapsed time per iteration (s): 15.25 | learning rate: 1.067E-05 | global batch size: 16 | lm loss: 6.401248E+00 | grad norm: 0.814 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2037/ 128728 | consumed samples: 32592 | consumed tokens: 66748416 | elapsed time per iteration (s): 15.23 | learning rate: 1.068E-05 | global batch size: 16 | lm loss: 6.626089E+00 | grad norm: 0.803 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2038/ 128728 | consumed samples: 32608 | consumed tokens: 66781184 | elapsed time per iteration (s): 15.24 | learning rate: 1.069E-05 | global batch size: 16 | lm loss: 6.504237E+00 | grad norm: 0.789 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2039/ 128728 | consumed samples: 32624 | consumed tokens: 66813952 | elapsed time per iteration (s): 15.22 | learning rate: 1.069E-05 | global batch size: 16 | lm loss: 6.728966E+00 | grad norm: 0.975 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2040/ 128728 | consumed samples: 32640 | consumed tokens: 66846720 | elapsed time per iteration (s): 15.21 | learning rate: 1.070E-05 | global batch size: 16 | lm loss: 6.668674E+00 | grad norm: 0.841 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2041/ 128728 | consumed samples: 32656 | consumed tokens: 66879488 | elapsed time per iteration (s): 15.22 | learning rate: 1.070E-05 | global batch size: 16 | lm loss: 6.519093E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2042/ 128728 | consumed samples: 32672 | consumed tokens: 66912256 | elapsed time per iteration (s): 15.17 | learning rate: 1.071E-05 | global batch size: 16 | lm loss: 6.491826E+00 | grad norm: 0.869 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 2043/ 128728 | consumed samples: 32688 | consumed tokens: 66945024 | elapsed time per iteration (s): 15.20 | learning rate: 1.071E-05 | global batch size: 16 | lm loss: 6.445816E+00 | grad norm: 0.789 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2044/ 128728 | consumed samples: 32704 | consumed tokens: 66977792 | elapsed time per iteration (s): 15.24 | learning rate: 1.072E-05 | global batch size: 16 | lm loss: 6.649926E+00 | grad norm: 0.864 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2045/ 128728 | consumed samples: 32720 | consumed tokens: 67010560 | elapsed time per iteration (s): 15.21 | learning rate: 1.072E-05 | global batch size: 16 | lm loss: 6.410240E+00 | grad norm: 0.805 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2046/ 128728 | consumed samples: 32736 | consumed tokens: 67043328 | elapsed time per iteration (s): 15.25 | learning rate: 1.073E-05 | global batch size: 16 | lm loss: 6.510799E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2047/ 128728 | consumed samples: 32752 | consumed tokens: 67076096 | elapsed time per iteration (s): 15.25 | learning rate: 1.073E-05 | global batch size: 16 | lm loss: 6.586518E+00 | grad norm: 0.854 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2048/ 128728 | consumed samples: 32768 | consumed tokens: 67108864 | elapsed time per iteration (s): 15.24 | learning rate: 1.074E-05 | global batch size: 16 | lm loss: 6.675879E+00 | grad norm: 0.822 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2049/ 128728 | consumed samples: 32784 | consumed tokens: 67141632 | elapsed time per iteration (s): 15.17 | learning rate: 1.074E-05 | global batch size: 16 | lm loss: 6.550882E+00 | grad norm: 0.823 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 2050/ 128728 | consumed samples: 32800 | consumed tokens: 67174400 | elapsed time per iteration (s): 15.22 | learning rate: 1.075E-05 | global batch size: 16 | lm loss: 6.626620E+00 | grad norm: 0.773 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2051/ 128728 | consumed samples: 32816 | consumed tokens: 67207168 | elapsed time per iteration (s): 15.18 | learning rate: 1.075E-05 | global batch size: 16 | lm loss: 6.336770E+00 | grad norm: 0.699 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2052/ 128728 | consumed samples: 32832 | consumed tokens: 67239936 | elapsed time per iteration (s): 15.23 | learning rate: 1.076E-05 | global batch size: 16 | lm loss: 6.473155E+00 | grad norm: 0.840 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2053/ 128728 | consumed samples: 32848 | consumed tokens: 67272704 | elapsed time per iteration (s): 15.22 | learning rate: 1.076E-05 | global batch size: 16 | lm loss: 6.799645E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2054/ 128728 | consumed samples: 32864 | consumed tokens: 67305472 | elapsed time per iteration (s): 15.21 | learning rate: 1.077E-05 | global batch size: 16 | lm loss: 6.377295E+00 | grad norm: 0.871 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2055/ 128728 | consumed samples: 32880 | consumed tokens: 67338240 | elapsed time per iteration (s): 15.17 | learning rate: 1.077E-05 | global batch size: 16 | lm loss: 6.436339E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 2056/ 128728 | consumed samples: 32896 | consumed tokens: 67371008 | elapsed time per iteration (s): 15.21 | learning rate: 1.078E-05 | global batch size: 16 | lm loss: 6.468864E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2057/ 128728 | consumed samples: 32912 | consumed tokens: 67403776 | elapsed time per iteration (s): 15.19 | learning rate: 1.078E-05 | global batch size: 16 | lm loss: 6.744889E+00 | grad norm: 0.759 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 2058/ 128728 | consumed samples: 32928 | consumed tokens: 67436544 | elapsed time per iteration (s): 15.16 | learning rate: 1.079E-05 | global batch size: 16 | lm loss: 6.324127E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2059/ 128728 | consumed samples: 32944 | consumed tokens: 67469312 | elapsed time per iteration (s): 15.26 | learning rate: 1.080E-05 | global batch size: 16 | lm loss: 6.515798E+00 | grad norm: 1.601 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2060/ 128728 | consumed samples: 32960 | consumed tokens: 67502080 | elapsed time per iteration (s): 15.29 | learning rate: 1.080E-05 | global batch size: 16 | lm loss: 6.469310E+00 | grad norm: 0.844 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.046 | TFLOPs: 8.01 | [default7]: iteration 2061/ 128728 | consumed samples: 32976 | consumed tokens: 67534848 | elapsed time per iteration (s): 15.15 | learning rate: 1.081E-05 | global batch size: 16 | lm loss: 6.753767E+00 | grad norm: 0.736 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 2062/ 128728 | consumed samples: 32992 | consumed tokens: 67567616 | elapsed time per iteration (s): 15.24 | learning rate: 1.081E-05 | global batch size: 16 | lm loss: 6.343292E+00 | grad norm: 0.835 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2063/ 128728 | consumed samples: 33008 | consumed tokens: 67600384 | elapsed time per iteration (s): 15.23 | learning rate: 1.082E-05 | global batch size: 16 | lm loss: 6.468085E+00 | grad norm: 1.017 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2064/ 128728 | consumed samples: 33024 | consumed tokens: 67633152 | elapsed time per iteration (s): 15.24 | learning rate: 1.082E-05 | global batch size: 16 | lm loss: 6.474153E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2065/ 128728 | consumed samples: 33040 | consumed tokens: 67665920 | elapsed time per iteration (s): 15.24 | learning rate: 1.083E-05 | global batch size: 16 | lm loss: 6.486255E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2066/ 128728 | consumed samples: 33056 | consumed tokens: 67698688 | elapsed time per iteration (s): 15.21 | learning rate: 1.083E-05 | global batch size: 16 | lm loss: 6.522897E+00 | grad norm: 0.857 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2067/ 128728 | consumed samples: 33072 | consumed tokens: 67731456 | elapsed time per iteration (s): 15.24 | learning rate: 1.084E-05 | global batch size: 16 | lm loss: 6.500715E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2068/ 128728 | consumed samples: 33088 | consumed tokens: 67764224 | elapsed time per iteration (s): 15.19 | learning rate: 1.084E-05 | global batch size: 16 | lm loss: 6.581506E+00 | grad norm: 0.828 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2069/ 128728 | consumed samples: 33104 | consumed tokens: 67796992 | elapsed time per iteration (s): 15.24 | learning rate: 1.085E-05 | global batch size: 16 | lm loss: 6.609797E+00 | grad norm: 0.984 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2070/ 128728 | consumed samples: 33120 | consumed tokens: 67829760 | elapsed time per iteration (s): 15.22 | learning rate: 1.085E-05 | global batch size: 16 | lm loss: 6.362771E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2071/ 128728 | consumed samples: 33136 | consumed tokens: 67862528 | elapsed time per iteration (s): 15.22 | learning rate: 1.086E-05 | global batch size: 16 | lm loss: 6.606649E+00 | grad norm: 0.987 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2072/ 128728 | consumed samples: 33152 | consumed tokens: 67895296 | elapsed time per iteration (s): 15.18 | learning rate: 1.086E-05 | global batch size: 16 | lm loss: 6.527749E+00 | grad norm: 0.820 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2073/ 128728 | consumed samples: 33168 | consumed tokens: 67928064 | elapsed time per iteration (s): 15.16 | learning rate: 1.087E-05 | global batch size: 16 | lm loss: 6.351327E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 2074/ 128728 | consumed samples: 33184 | consumed tokens: 67960832 | elapsed time per iteration (s): 15.20 | learning rate: 1.087E-05 | global batch size: 16 | lm loss: 6.256629E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2075/ 128728 | consumed samples: 33200 | consumed tokens: 67993600 | elapsed time per iteration (s): 15.19 | learning rate: 1.088E-05 | global batch size: 16 | lm loss: 6.681695E+00 | grad norm: 0.842 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2076/ 128728 | consumed samples: 33216 | consumed tokens: 68026368 | elapsed time per iteration (s): 15.20 | learning rate: 1.088E-05 | global batch size: 16 | lm loss: 6.361331E+00 | grad norm: 0.811 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2077/ 128728 | consumed samples: 33232 | consumed tokens: 68059136 | elapsed time per iteration (s): 15.18 | learning rate: 1.089E-05 | global batch size: 16 | lm loss: 6.319121E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2078/ 128728 | consumed samples: 33248 | consumed tokens: 68091904 | elapsed time per iteration (s): 15.21 | learning rate: 1.089E-05 | global batch size: 16 | lm loss: 6.285818E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2079/ 128728 | consumed samples: 33264 | consumed tokens: 68124672 | elapsed time per iteration (s): 15.22 | learning rate: 1.090E-05 | global batch size: 16 | lm loss: 6.337141E+00 | grad norm: 0.741 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2080/ 128728 | consumed samples: 33280 | consumed tokens: 68157440 | elapsed time per iteration (s): 15.24 | learning rate: 1.091E-05 | global batch size: 16 | lm loss: 6.420028E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2081/ 128728 | consumed samples: 33296 | consumed tokens: 68190208 | elapsed time per iteration (s): 15.15 | learning rate: 1.091E-05 | global batch size: 16 | lm loss: 6.409562E+00 | grad norm: 0.962 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 2082/ 128728 | consumed samples: 33312 | consumed tokens: 68222976 | elapsed time per iteration (s): 15.22 | learning rate: 1.092E-05 | global batch size: 16 | lm loss: 6.588019E+00 | grad norm: 0.830 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2083/ 128728 | consumed samples: 33328 | consumed tokens: 68255744 | elapsed time per iteration (s): 15.23 | learning rate: 1.092E-05 | global batch size: 16 | lm loss: 6.473838E+00 | grad norm: 0.969 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2084/ 128728 | consumed samples: 33344 | consumed tokens: 68288512 | elapsed time per iteration (s): 15.20 | learning rate: 1.093E-05 | global batch size: 16 | lm loss: 6.194841E+00 | grad norm: 0.766 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2085/ 128728 | consumed samples: 33360 | consumed tokens: 68321280 | elapsed time per iteration (s): 15.22 | learning rate: 1.093E-05 | global batch size: 16 | lm loss: 6.565664E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2086/ 128728 | consumed samples: 33376 | consumed tokens: 68354048 | elapsed time per iteration (s): 15.18 | learning rate: 1.094E-05 | global batch size: 16 | lm loss: 6.302047E+00 | grad norm: 0.772 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2087/ 128728 | consumed samples: 33392 | consumed tokens: 68386816 | elapsed time per iteration (s): 15.20 | learning rate: 1.094E-05 | global batch size: 16 | lm loss: 6.493527E+00 | grad norm: 0.878 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2088/ 128728 | consumed samples: 33408 | consumed tokens: 68419584 | elapsed time per iteration (s): 15.15 | learning rate: 1.095E-05 | global batch size: 16 | lm loss: 6.456075E+00 | grad norm: 0.963 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 2089/ 128728 | consumed samples: 33424 | consumed tokens: 68452352 | elapsed time per iteration (s): 15.23 | learning rate: 1.095E-05 | global batch size: 16 | lm loss: 6.435150E+00 | grad norm: 1.070 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2090/ 128728 | consumed samples: 33440 | consumed tokens: 68485120 | elapsed time per iteration (s): 15.19 | learning rate: 1.096E-05 | global batch size: 16 | lm loss: 6.466596E+00 | grad norm: 0.929 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2091/ 128728 | consumed samples: 33456 | consumed tokens: 68517888 | elapsed time per iteration (s): 15.21 | learning rate: 1.096E-05 | global batch size: 16 | lm loss: 6.540755E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2092/ 128728 | consumed samples: 33472 | consumed tokens: 68550656 | elapsed time per iteration (s): 15.22 | learning rate: 1.097E-05 | global batch size: 16 | lm loss: 6.240240E+00 | grad norm: 1.037 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2093/ 128728 | consumed samples: 33488 | consumed tokens: 68583424 | elapsed time per iteration (s): 15.23 | learning rate: 1.097E-05 | global batch size: 16 | lm loss: 6.645574E+00 | grad norm: 0.897 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2094/ 128728 | consumed samples: 33504 | consumed tokens: 68616192 | elapsed time per iteration (s): 15.23 | learning rate: 1.098E-05 | global batch size: 16 | lm loss: 6.572923E+00 | grad norm: 2.065 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2095/ 128728 | consumed samples: 33520 | consumed tokens: 68648960 | elapsed time per iteration (s): 15.24 | learning rate: 1.098E-05 | global batch size: 16 | lm loss: 6.311743E+00 | grad norm: 0.726 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2096/ 128728 | consumed samples: 33536 | consumed tokens: 68681728 | elapsed time per iteration (s): 15.26 | learning rate: 1.099E-05 | global batch size: 16 | lm loss: 6.382287E+00 | grad norm: 1.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2097/ 128728 | consumed samples: 33552 | consumed tokens: 68714496 | elapsed time per iteration (s): 15.22 | learning rate: 1.099E-05 | global batch size: 16 | lm loss: 6.491906E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2098/ 128728 | consumed samples: 33568 | consumed tokens: 68747264 | elapsed time per iteration (s): 15.25 | learning rate: 1.100E-05 | global batch size: 16 | lm loss: 6.311732E+00 | grad norm: 0.658 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2099/ 128728 | consumed samples: 33584 | consumed tokens: 68780032 | elapsed time per iteration (s): 15.23 | learning rate: 1.100E-05 | global batch size: 16 | lm loss: 6.513503E+00 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2100/ 128728 | consumed samples: 33600 | consumed tokens: 68812800 | elapsed time per iteration (s): 15.22 | learning rate: 1.101E-05 | global batch size: 16 | lm loss: 6.404696E+00 | grad norm: 0.864 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2101/ 128728 | consumed samples: 33616 | consumed tokens: 68845568 | elapsed time per iteration (s): 15.23 | learning rate: 1.102E-05 | global batch size: 16 | lm loss: 6.420318E+00 | grad norm: 0.830 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2102/ 128728 | consumed samples: 33632 | consumed tokens: 68878336 | elapsed time per iteration (s): 15.22 | learning rate: 1.102E-05 | global batch size: 16 | lm loss: 6.275981E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2103/ 128728 | consumed samples: 33648 | consumed tokens: 68911104 | elapsed time per iteration (s): 15.22 | learning rate: 1.103E-05 | global batch size: 16 | lm loss: 6.372358E+00 | grad norm: 0.762 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2104/ 128728 | consumed samples: 33664 | consumed tokens: 68943872 | elapsed time per iteration (s): 15.21 | learning rate: 1.103E-05 | global batch size: 16 | lm loss: 6.088941E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2105/ 128728 | consumed samples: 33680 | consumed tokens: 68976640 | elapsed time per iteration (s): 15.20 | learning rate: 1.104E-05 | global batch size: 16 | lm loss: 6.542912E+00 | grad norm: 0.862 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2106/ 128728 | consumed samples: 33696 | consumed tokens: 69009408 | elapsed time per iteration (s): 15.20 | learning rate: 1.104E-05 | global batch size: 16 | lm loss: 6.359058E+00 | grad norm: 0.859 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2107/ 128728 | consumed samples: 33712 | consumed tokens: 69042176 | elapsed time per iteration (s): 15.24 | learning rate: 1.105E-05 | global batch size: 16 | lm loss: 6.501265E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2108/ 128728 | consumed samples: 33728 | consumed tokens: 69074944 | elapsed time per iteration (s): 15.21 | learning rate: 1.105E-05 | global batch size: 16 | lm loss: 6.367177E+00 | grad norm: 0.688 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2109/ 128728 | consumed samples: 33744 | consumed tokens: 69107712 | elapsed time per iteration (s): 15.22 | learning rate: 1.106E-05 | global batch size: 16 | lm loss: 6.246887E+00 | grad norm: 0.837 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2110/ 128728 | consumed samples: 33760 | consumed tokens: 69140480 | elapsed time per iteration (s): 15.23 | learning rate: 1.106E-05 | global batch size: 16 | lm loss: 6.294720E+00 | grad norm: 0.777 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2111/ 128728 | consumed samples: 33776 | consumed tokens: 69173248 | elapsed time per iteration (s): 15.24 | learning rate: 1.107E-05 | global batch size: 16 | lm loss: 6.356379E+00 | grad norm: 0.958 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2112/ 128728 | consumed samples: 33792 | consumed tokens: 69206016 | elapsed time per iteration (s): 15.21 | learning rate: 1.107E-05 | global batch size: 16 | lm loss: 6.442330E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2113/ 128728 | consumed samples: 33808 | consumed tokens: 69238784 | elapsed time per iteration (s): 15.22 | learning rate: 1.108E-05 | global batch size: 16 | lm loss: 6.351761E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2114/ 128728 | consumed samples: 33824 | consumed tokens: 69271552 | elapsed time per iteration (s): 15.25 | learning rate: 1.108E-05 | global batch size: 16 | lm loss: 6.381479E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2115/ 128728 | consumed samples: 33840 | consumed tokens: 69304320 | elapsed time per iteration (s): 15.25 | learning rate: 1.109E-05 | global batch size: 16 | lm loss: 6.759895E+00 | grad norm: 1.024 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2116/ 128728 | consumed samples: 33856 | consumed tokens: 69337088 | elapsed time per iteration (s): 15.21 | learning rate: 1.109E-05 | global batch size: 16 | lm loss: 6.386426E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2117/ 128728 | consumed samples: 33872 | consumed tokens: 69369856 | elapsed time per iteration (s): 15.18 | learning rate: 1.110E-05 | global batch size: 16 | lm loss: 6.215895E+00 | grad norm: 0.833 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2118/ 128728 | consumed samples: 33888 | consumed tokens: 69402624 | elapsed time per iteration (s): 15.21 | learning rate: 1.110E-05 | global batch size: 16 | lm loss: 6.337823E+00 | grad norm: 0.837 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2119/ 128728 | consumed samples: 33904 | consumed tokens: 69435392 | elapsed time per iteration (s): 15.19 | learning rate: 1.111E-05 | global batch size: 16 | lm loss: 6.306813E+00 | grad norm: 1.303 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2120/ 128728 | consumed samples: 33920 | consumed tokens: 69468160 | elapsed time per iteration (s): 15.21 | learning rate: 1.111E-05 | global batch size: 16 | lm loss: 6.559892E+00 | grad norm: 1.107 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2121/ 128728 | consumed samples: 33936 | consumed tokens: 69500928 | elapsed time per iteration (s): 15.16 | learning rate: 1.112E-05 | global batch size: 16 | lm loss: 6.418102E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2122/ 128728 | consumed samples: 33952 | consumed tokens: 69533696 | elapsed time per iteration (s): 15.17 | learning rate: 1.113E-05 | global batch size: 16 | lm loss: 6.318988E+00 | grad norm: 1.077 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2123/ 128728 | consumed samples: 33968 | consumed tokens: 69566464 | elapsed time per iteration (s): 15.20 | learning rate: 1.113E-05 | global batch size: 16 | lm loss: 6.536681E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2124/ 128728 | consumed samples: 33984 | consumed tokens: 69599232 | elapsed time per iteration (s): 15.20 | learning rate: 1.114E-05 | global batch size: 16 | lm loss: 6.491345E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2125/ 128728 | consumed samples: 34000 | consumed tokens: 69632000 | elapsed time per iteration (s): 15.20 | learning rate: 1.114E-05 | global batch size: 16 | lm loss: 6.226834E+00 | grad norm: 0.898 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2126/ 128728 | consumed samples: 34016 | consumed tokens: 69664768 | elapsed time per iteration (s): 15.17 | learning rate: 1.115E-05 | global batch size: 16 | lm loss: 6.499043E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2127/ 128728 | consumed samples: 34032 | consumed tokens: 69697536 | elapsed time per iteration (s): 15.14 | learning rate: 1.115E-05 | global batch size: 16 | lm loss: 6.487580E+00 | grad norm: 0.758 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 2128/ 128728 | consumed samples: 34048 | consumed tokens: 69730304 | elapsed time per iteration (s): 15.22 | learning rate: 1.116E-05 | global batch size: 16 | lm loss: 6.466814E+00 | grad norm: 0.928 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2129/ 128728 | consumed samples: 34064 | consumed tokens: 69763072 | elapsed time per iteration (s): 15.21 | learning rate: 1.116E-05 | global batch size: 16 | lm loss: 6.457569E+00 | grad norm: 0.920 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2130/ 128728 | consumed samples: 34080 | consumed tokens: 69795840 | elapsed time per iteration (s): 15.16 | learning rate: 1.117E-05 | global batch size: 16 | lm loss: 6.301426E+00 | grad norm: 0.806 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2131/ 128728 | consumed samples: 34096 | consumed tokens: 69828608 | elapsed time per iteration (s): 15.15 | learning rate: 1.117E-05 | global batch size: 16 | lm loss: 6.291666E+00 | grad norm: 0.835 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 2132/ 128728 | consumed samples: 34112 | consumed tokens: 69861376 | elapsed time per iteration (s): 15.18 | learning rate: 1.118E-05 | global batch size: 16 | lm loss: 6.416221E+00 | grad norm: 1.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2133/ 128728 | consumed samples: 34128 | consumed tokens: 69894144 | elapsed time per iteration (s): 15.14 | learning rate: 1.118E-05 | global batch size: 16 | lm loss: 6.413527E+00 | grad norm: 0.823 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 2134/ 128728 | consumed samples: 34144 | consumed tokens: 69926912 | elapsed time per iteration (s): 15.21 | learning rate: 1.119E-05 | global batch size: 16 | lm loss: 6.463112E+00 | grad norm: 0.873 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2135/ 128728 | consumed samples: 34160 | consumed tokens: 69959680 | elapsed time per iteration (s): 15.22 | learning rate: 1.119E-05 | global batch size: 16 | lm loss: 6.217474E+00 | grad norm: 0.872 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2136/ 128728 | consumed samples: 34176 | consumed tokens: 69992448 | elapsed time per iteration (s): 15.22 | learning rate: 1.120E-05 | global batch size: 16 | lm loss: 6.451793E+00 | grad norm: 0.918 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2137/ 128728 | consumed samples: 34192 | consumed tokens: 70025216 | elapsed time per iteration (s): 15.21 | learning rate: 1.120E-05 | global batch size: 16 | lm loss: 6.483500E+00 | grad norm: 0.829 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2138/ 128728 | consumed samples: 34208 | consumed tokens: 70057984 | elapsed time per iteration (s): 15.23 | learning rate: 1.121E-05 | global batch size: 16 | lm loss: 6.475822E+00 | grad norm: 1.001 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2139/ 128728 | consumed samples: 34224 | consumed tokens: 70090752 | elapsed time per iteration (s): 15.24 | learning rate: 1.121E-05 | global batch size: 16 | lm loss: 6.433506E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2140/ 128728 | consumed samples: 34240 | consumed tokens: 70123520 | elapsed time per iteration (s): 15.19 | learning rate: 1.122E-05 | global batch size: 16 | lm loss: 6.512136E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2141/ 128728 | consumed samples: 34256 | consumed tokens: 70156288 | elapsed time per iteration (s): 15.22 | learning rate: 1.123E-05 | global batch size: 16 | lm loss: 6.240833E+00 | grad norm: 0.759 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2142/ 128728 | consumed samples: 34272 | consumed tokens: 70189056 | elapsed time per iteration (s): 15.22 | learning rate: 1.123E-05 | global batch size: 16 | lm loss: 6.371235E+00 | grad norm: 0.952 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2143/ 128728 | consumed samples: 34288 | consumed tokens: 70221824 | elapsed time per iteration (s): 15.21 | learning rate: 1.124E-05 | global batch size: 16 | lm loss: 6.270912E+00 | grad norm: 1.476 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2144/ 128728 | consumed samples: 34304 | consumed tokens: 70254592 | elapsed time per iteration (s): 15.22 | learning rate: 1.124E-05 | global batch size: 16 | lm loss: 6.396859E+00 | grad norm: 0.845 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2145/ 128728 | consumed samples: 34320 | consumed tokens: 70287360 | elapsed time per iteration (s): 15.22 | learning rate: 1.125E-05 | global batch size: 16 | lm loss: 6.418116E+00 | grad norm: 0.828 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2146/ 128728 | consumed samples: 34336 | consumed tokens: 70320128 | elapsed time per iteration (s): 15.25 | learning rate: 1.125E-05 | global batch size: 16 | lm loss: 6.358143E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2147/ 128728 | consumed samples: 34352 | consumed tokens: 70352896 | elapsed time per iteration (s): 15.23 | learning rate: 1.126E-05 | global batch size: 16 | lm loss: 6.441215E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2148/ 128728 | consumed samples: 34368 | consumed tokens: 70385664 | elapsed time per iteration (s): 15.21 | learning rate: 1.126E-05 | global batch size: 16 | lm loss: 6.519576E+00 | grad norm: 0.928 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2149/ 128728 | consumed samples: 34384 | consumed tokens: 70418432 | elapsed time per iteration (s): 15.25 | learning rate: 1.127E-05 | global batch size: 16 | lm loss: 6.501801E+00 | grad norm: 1.110 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2150/ 128728 | consumed samples: 34400 | consumed tokens: 70451200 | elapsed time per iteration (s): 15.22 | learning rate: 1.127E-05 | global batch size: 16 | lm loss: 6.442147E+00 | grad norm: 0.705 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2151/ 128728 | consumed samples: 34416 | consumed tokens: 70483968 | elapsed time per iteration (s): 15.23 | learning rate: 1.128E-05 | global batch size: 16 | lm loss: 6.503671E+00 | grad norm: 0.831 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2152/ 128728 | consumed samples: 34432 | consumed tokens: 70516736 | elapsed time per iteration (s): 15.25 | learning rate: 1.128E-05 | global batch size: 16 | lm loss: 6.525469E+00 | grad norm: 1.080 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2153/ 128728 | consumed samples: 34448 | consumed tokens: 70549504 | elapsed time per iteration (s): 15.20 | learning rate: 1.129E-05 | global batch size: 16 | lm loss: 6.287985E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2154/ 128728 | consumed samples: 34464 | consumed tokens: 70582272 | elapsed time per iteration (s): 15.16 | learning rate: 1.129E-05 | global batch size: 16 | lm loss: 6.422863E+00 | grad norm: 0.700 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2155/ 128728 | consumed samples: 34480 | consumed tokens: 70615040 | elapsed time per iteration (s): 15.19 | learning rate: 1.130E-05 | global batch size: 16 | lm loss: 6.324255E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2156/ 128728 | consumed samples: 34496 | consumed tokens: 70647808 | elapsed time per iteration (s): 15.23 | learning rate: 1.130E-05 | global batch size: 16 | lm loss: 6.228876E+00 | grad norm: 0.787 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2157/ 128728 | consumed samples: 34512 | consumed tokens: 70680576 | elapsed time per iteration (s): 15.24 | learning rate: 1.131E-05 | global batch size: 16 | lm loss: 6.399559E+00 | grad norm: 0.998 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2158/ 128728 | consumed samples: 34528 | consumed tokens: 70713344 | elapsed time per iteration (s): 15.20 | learning rate: 1.131E-05 | global batch size: 16 | lm loss: 6.407733E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2159/ 128728 | consumed samples: 34544 | consumed tokens: 70746112 | elapsed time per iteration (s): 15.24 | learning rate: 1.132E-05 | global batch size: 16 | lm loss: 6.550023E+00 | grad norm: 0.694 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2160/ 128728 | consumed samples: 34560 | consumed tokens: 70778880 | elapsed time per iteration (s): 15.20 | learning rate: 1.132E-05 | global batch size: 16 | lm loss: 6.425354E+00 | grad norm: 0.911 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2161/ 128728 | consumed samples: 34576 | consumed tokens: 70811648 | elapsed time per iteration (s): 15.21 | learning rate: 1.133E-05 | global batch size: 16 | lm loss: 6.268637E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2162/ 128728 | consumed samples: 34592 | consumed tokens: 70844416 | elapsed time per iteration (s): 15.22 | learning rate: 1.134E-05 | global batch size: 16 | lm loss: 6.435404E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2163/ 128728 | consumed samples: 34608 | consumed tokens: 70877184 | elapsed time per iteration (s): 15.23 | learning rate: 1.134E-05 | global batch size: 16 | lm loss: 6.342821E+00 | grad norm: 0.878 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2164/ 128728 | consumed samples: 34624 | consumed tokens: 70909952 | elapsed time per iteration (s): 15.23 | learning rate: 1.135E-05 | global batch size: 16 | lm loss: 6.411083E+00 | grad norm: 0.828 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2165/ 128728 | consumed samples: 34640 | consumed tokens: 70942720 | elapsed time per iteration (s): 15.21 | learning rate: 1.135E-05 | global batch size: 16 | lm loss: 6.366318E+00 | grad norm: 0.844 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2166/ 128728 | consumed samples: 34656 | consumed tokens: 70975488 | elapsed time per iteration (s): 15.22 | learning rate: 1.136E-05 | global batch size: 16 | lm loss: 6.391248E+00 | grad norm: 0.966 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2167/ 128728 | consumed samples: 34672 | consumed tokens: 71008256 | elapsed time per iteration (s): 15.19 | learning rate: 1.136E-05 | global batch size: 16 | lm loss: 6.311316E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2168/ 128728 | consumed samples: 34688 | consumed tokens: 71041024 | elapsed time per iteration (s): 15.17 | learning rate: 1.137E-05 | global batch size: 16 | lm loss: 6.601685E+00 | grad norm: 0.788 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2169/ 128728 | consumed samples: 34704 | consumed tokens: 71073792 | elapsed time per iteration (s): 15.21 | learning rate: 1.137E-05 | global batch size: 16 | lm loss: 6.726507E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2170/ 128728 | consumed samples: 34720 | consumed tokens: 71106560 | elapsed time per iteration (s): 15.23 | learning rate: 1.138E-05 | global batch size: 16 | lm loss: 6.186215E+00 | grad norm: 0.914 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2171/ 128728 | consumed samples: 34736 | consumed tokens: 71139328 | elapsed time per iteration (s): 15.17 | learning rate: 1.138E-05 | global batch size: 16 | lm loss: 6.277987E+00 | grad norm: 0.777 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 2172/ 128728 | consumed samples: 34752 | consumed tokens: 71172096 | elapsed time per iteration (s): 15.21 | learning rate: 1.139E-05 | global batch size: 16 | lm loss: 6.518786E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2173/ 128728 | consumed samples: 34768 | consumed tokens: 71204864 | elapsed time per iteration (s): 15.15 | learning rate: 1.139E-05 | global batch size: 16 | lm loss: 6.218275E+00 | grad norm: 0.720 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 2174/ 128728 | consumed samples: 34784 | consumed tokens: 71237632 | elapsed time per iteration (s): 15.25 | learning rate: 1.140E-05 | global batch size: 16 | lm loss: 6.684814E+00 | grad norm: 1.005 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2175/ 128728 | consumed samples: 34800 | consumed tokens: 71270400 | elapsed time per iteration (s): 15.22 | learning rate: 1.140E-05 | global batch size: 16 | lm loss: 6.340273E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2176/ 128728 | consumed samples: 34816 | consumed tokens: 71303168 | elapsed time per iteration (s): 15.15 | learning rate: 1.141E-05 | global batch size: 16 | lm loss: 6.500647E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 2177/ 128728 | consumed samples: 34832 | consumed tokens: 71335936 | elapsed time per iteration (s): 15.22 | learning rate: 1.141E-05 | global batch size: 16 | lm loss: 6.369704E+00 | grad norm: 0.783 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2178/ 128728 | consumed samples: 34848 | consumed tokens: 71368704 | elapsed time per iteration (s): 15.21 | learning rate: 1.142E-05 | global batch size: 16 | lm loss: 6.439621E+00 | grad norm: 0.739 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2179/ 128728 | consumed samples: 34864 | consumed tokens: 71401472 | elapsed time per iteration (s): 15.21 | learning rate: 1.142E-05 | global batch size: 16 | lm loss: 6.381093E+00 | grad norm: 1.039 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2180/ 128728 | consumed samples: 34880 | consumed tokens: 71434240 | elapsed time per iteration (s): 15.18 | learning rate: 1.143E-05 | global batch size: 16 | lm loss: 6.661847E+00 | grad norm: 0.701 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2181/ 128728 | consumed samples: 34896 | consumed tokens: 71467008 | elapsed time per iteration (s): 15.25 | learning rate: 1.143E-05 | global batch size: 16 | lm loss: 6.390566E+00 | grad norm: 1.006 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2182/ 128728 | consumed samples: 34912 | consumed tokens: 71499776 | elapsed time per iteration (s): 15.23 | learning rate: 1.144E-05 | global batch size: 16 | lm loss: 6.537359E+00 | grad norm: 0.961 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2183/ 128728 | consumed samples: 34928 | consumed tokens: 71532544 | elapsed time per iteration (s): 15.25 | learning rate: 1.145E-05 | global batch size: 16 | lm loss: 6.467527E+00 | grad norm: 0.950 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2184/ 128728 | consumed samples: 34944 | consumed tokens: 71565312 | elapsed time per iteration (s): 15.18 | learning rate: 1.145E-05 | global batch size: 16 | lm loss: 6.535425E+00 | grad norm: 0.840 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2185/ 128728 | consumed samples: 34960 | consumed tokens: 71598080 | elapsed time per iteration (s): 15.24 | learning rate: 1.146E-05 | global batch size: 16 | lm loss: 6.346310E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2186/ 128728 | consumed samples: 34976 | consumed tokens: 71630848 | elapsed time per iteration (s): 15.23 | learning rate: 1.146E-05 | global batch size: 16 | lm loss: 6.353755E+00 | grad norm: 1.825 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2187/ 128728 | consumed samples: 34992 | consumed tokens: 71663616 | elapsed time per iteration (s): 15.25 | learning rate: 1.147E-05 | global batch size: 16 | lm loss: 6.488267E+00 | grad norm: 0.741 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2188/ 128728 | consumed samples: 35008 | consumed tokens: 71696384 | elapsed time per iteration (s): 15.19 | learning rate: 1.147E-05 | global batch size: 16 | lm loss: 6.271044E+00 | grad norm: 0.796 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2189/ 128728 | consumed samples: 35024 | consumed tokens: 71729152 | elapsed time per iteration (s): 15.22 | learning rate: 1.148E-05 | global batch size: 16 | lm loss: 6.419786E+00 | grad norm: 0.798 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2190/ 128728 | consumed samples: 35040 | consumed tokens: 71761920 | elapsed time per iteration (s): 15.21 | learning rate: 1.148E-05 | global batch size: 16 | lm loss: 6.286393E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2191/ 128728 | consumed samples: 35056 | consumed tokens: 71794688 | elapsed time per iteration (s): 15.24 | learning rate: 1.149E-05 | global batch size: 16 | lm loss: 6.343496E+00 | grad norm: 0.741 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2192/ 128728 | consumed samples: 35072 | consumed tokens: 71827456 | elapsed time per iteration (s): 15.21 | learning rate: 1.149E-05 | global batch size: 16 | lm loss: 6.306832E+00 | grad norm: 0.791 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2193/ 128728 | consumed samples: 35088 | consumed tokens: 71860224 | elapsed time per iteration (s): 15.20 | learning rate: 1.150E-05 | global batch size: 16 | lm loss: 6.315264E+00 | grad norm: 0.780 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2194/ 128728 | consumed samples: 35104 | consumed tokens: 71892992 | elapsed time per iteration (s): 15.24 | learning rate: 1.150E-05 | global batch size: 16 | lm loss: 6.208344E+00 | grad norm: 0.859 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2195/ 128728 | consumed samples: 35120 | consumed tokens: 71925760 | elapsed time per iteration (s): 15.22 | learning rate: 1.151E-05 | global batch size: 16 | lm loss: 6.307616E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2196/ 128728 | consumed samples: 35136 | consumed tokens: 71958528 | elapsed time per iteration (s): 15.22 | learning rate: 1.151E-05 | global batch size: 16 | lm loss: 6.192717E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2197/ 128728 | consumed samples: 35152 | consumed tokens: 71991296 | elapsed time per iteration (s): 15.20 | learning rate: 1.152E-05 | global batch size: 16 | lm loss: 6.418719E+00 | grad norm: 0.821 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2198/ 128728 | consumed samples: 35168 | consumed tokens: 72024064 | elapsed time per iteration (s): 15.19 | learning rate: 1.152E-05 | global batch size: 16 | lm loss: 6.245737E+00 | grad norm: 0.808 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2199/ 128728 | consumed samples: 35184 | consumed tokens: 72056832 | elapsed time per iteration (s): 15.21 | learning rate: 1.153E-05 | global batch size: 16 | lm loss: 6.310443E+00 | grad norm: 0.853 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2200/ 128728 | consumed samples: 35200 | consumed tokens: 72089600 | elapsed time per iteration (s): 15.21 | learning rate: 1.153E-05 | global batch size: 16 | lm loss: 6.745331E+00 | grad norm: 0.852 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2201/ 128728 | consumed samples: 35216 | consumed tokens: 72122368 | elapsed time per iteration (s): 15.19 | learning rate: 1.154E-05 | global batch size: 16 | lm loss: 6.420246E+00 | grad norm: 1.406 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2202/ 128728 | consumed samples: 35232 | consumed tokens: 72155136 | elapsed time per iteration (s): 15.22 | learning rate: 1.154E-05 | global batch size: 16 | lm loss: 6.487600E+00 | grad norm: 0.883 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2203/ 128728 | consumed samples: 35248 | consumed tokens: 72187904 | elapsed time per iteration (s): 15.20 | learning rate: 1.155E-05 | global batch size: 16 | lm loss: 6.501083E+00 | grad norm: 0.811 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2204/ 128728 | consumed samples: 35264 | consumed tokens: 72220672 | elapsed time per iteration (s): 15.22 | learning rate: 1.156E-05 | global batch size: 16 | lm loss: 6.380270E+00 | grad norm: 1.015 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2205/ 128728 | consumed samples: 35280 | consumed tokens: 72253440 | elapsed time per iteration (s): 15.21 | learning rate: 1.156E-05 | global batch size: 16 | lm loss: 6.324718E+00 | grad norm: 0.762 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2206/ 128728 | consumed samples: 35296 | consumed tokens: 72286208 | elapsed time per iteration (s): 15.23 | learning rate: 1.157E-05 | global batch size: 16 | lm loss: 6.390339E+00 | grad norm: 0.785 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2207/ 128728 | consumed samples: 35312 | consumed tokens: 72318976 | elapsed time per iteration (s): 15.21 | learning rate: 1.157E-05 | global batch size: 16 | lm loss: 6.343199E+00 | grad norm: 0.829 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2208/ 128728 | consumed samples: 35328 | consumed tokens: 72351744 | elapsed time per iteration (s): 15.20 | learning rate: 1.158E-05 | global batch size: 16 | lm loss: 6.292582E+00 | grad norm: 0.819 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2209/ 128728 | consumed samples: 35344 | consumed tokens: 72384512 | elapsed time per iteration (s): 15.21 | learning rate: 1.158E-05 | global batch size: 16 | lm loss: 6.351970E+00 | grad norm: 0.734 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2210/ 128728 | consumed samples: 35360 | consumed tokens: 72417280 | elapsed time per iteration (s): 15.26 | learning rate: 1.159E-05 | global batch size: 16 | lm loss: 6.413839E+00 | grad norm: 0.939 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 2211/ 128728 | consumed samples: 35376 | consumed tokens: 72450048 | elapsed time per iteration (s): 15.21 | learning rate: 1.159E-05 | global batch size: 16 | lm loss: 6.559430E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2212/ 128728 | consumed samples: 35392 | consumed tokens: 72482816 | elapsed time per iteration (s): 15.22 | learning rate: 1.160E-05 | global batch size: 16 | lm loss: 6.109778E+00 | grad norm: 0.801 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2213/ 128728 | consumed samples: 35408 | consumed tokens: 72515584 | elapsed time per iteration (s): 15.25 | learning rate: 1.160E-05 | global batch size: 16 | lm loss: 6.061421E+00 | grad norm: 1.314 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2214/ 128728 | consumed samples: 35424 | consumed tokens: 72548352 | elapsed time per iteration (s): 15.22 | learning rate: 1.161E-05 | global batch size: 16 | lm loss: 6.424275E+00 | grad norm: 0.843 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2215/ 128728 | consumed samples: 35440 | consumed tokens: 72581120 | elapsed time per iteration (s): 15.24 | learning rate: 1.161E-05 | global batch size: 16 | lm loss: 6.570379E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2216/ 128728 | consumed samples: 35456 | consumed tokens: 72613888 | elapsed time per iteration (s): 15.19 | learning rate: 1.162E-05 | global batch size: 16 | lm loss: 6.441628E+00 | grad norm: 0.882 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2217/ 128728 | consumed samples: 35472 | consumed tokens: 72646656 | elapsed time per iteration (s): 15.23 | learning rate: 1.162E-05 | global batch size: 16 | lm loss: 6.402570E+00 | grad norm: 0.800 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2218/ 128728 | consumed samples: 35488 | consumed tokens: 72679424 | elapsed time per iteration (s): 15.21 | learning rate: 1.163E-05 | global batch size: 16 | lm loss: 6.482116E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2219/ 128728 | consumed samples: 35504 | consumed tokens: 72712192 | elapsed time per iteration (s): 15.23 | learning rate: 1.163E-05 | global batch size: 16 | lm loss: 6.316390E+00 | grad norm: 0.776 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2220/ 128728 | consumed samples: 35520 | consumed tokens: 72744960 | elapsed time per iteration (s): 15.24 | learning rate: 1.164E-05 | global batch size: 16 | lm loss: 6.419680E+00 | grad norm: 0.890 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2221/ 128728 | consumed samples: 35536 | consumed tokens: 72777728 | elapsed time per iteration (s): 15.23 | learning rate: 1.164E-05 | global batch size: 16 | lm loss: 6.395838E+00 | grad norm: 0.822 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2222/ 128728 | consumed samples: 35552 | consumed tokens: 72810496 | elapsed time per iteration (s): 15.19 | learning rate: 1.165E-05 | global batch size: 16 | lm loss: 6.283474E+00 | grad norm: 0.927 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2223/ 128728 | consumed samples: 35568 | consumed tokens: 72843264 | elapsed time per iteration (s): 15.21 | learning rate: 1.165E-05 | global batch size: 16 | lm loss: 6.431798E+00 | grad norm: 0.909 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2224/ 128728 | consumed samples: 35584 | consumed tokens: 72876032 | elapsed time per iteration (s): 15.16 | learning rate: 1.166E-05 | global batch size: 16 | lm loss: 6.408734E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 2225/ 128728 | consumed samples: 35600 | consumed tokens: 72908800 | elapsed time per iteration (s): 15.19 | learning rate: 1.167E-05 | global batch size: 16 | lm loss: 6.471613E+00 | grad norm: 0.958 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2226/ 128728 | consumed samples: 35616 | consumed tokens: 72941568 | elapsed time per iteration (s): 15.22 | learning rate: 1.167E-05 | global batch size: 16 | lm loss: 6.622155E+00 | grad norm: 1.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2227/ 128728 | consumed samples: 35632 | consumed tokens: 72974336 | elapsed time per iteration (s): 15.19 | learning rate: 1.168E-05 | global batch size: 16 | lm loss: 6.308994E+00 | grad norm: 0.736 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2228/ 128728 | consumed samples: 35648 | consumed tokens: 73007104 | elapsed time per iteration (s): 15.23 | learning rate: 1.168E-05 | global batch size: 16 | lm loss: 6.676260E+00 | grad norm: 1.552 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2229/ 128728 | consumed samples: 35664 | consumed tokens: 73039872 | elapsed time per iteration (s): 15.19 | learning rate: 1.169E-05 | global batch size: 16 | lm loss: 6.296388E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 2230/ 128728 | consumed samples: 35680 | consumed tokens: 73072640 | elapsed time per iteration (s): 15.20 | learning rate: 1.169E-05 | global batch size: 16 | lm loss: 6.460938E+00 | grad norm: 0.818 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2231/ 128728 | consumed samples: 35696 | consumed tokens: 73105408 | elapsed time per iteration (s): 15.21 | learning rate: 1.170E-05 | global batch size: 16 | lm loss: 5.970007E+00 | grad norm: 1.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2232/ 128728 | consumed samples: 35712 | consumed tokens: 73138176 | elapsed time per iteration (s): 15.21 | learning rate: 1.170E-05 | global batch size: 16 | lm loss: 6.361232E+00 | grad norm: 0.855 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2233/ 128728 | consumed samples: 35728 | consumed tokens: 73170944 | elapsed time per iteration (s): 15.19 | learning rate: 1.171E-05 | global batch size: 16 | lm loss: 6.304492E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2234/ 128728 | consumed samples: 35744 | consumed tokens: 73203712 | elapsed time per iteration (s): 15.19 | learning rate: 1.171E-05 | global batch size: 16 | lm loss: 6.355456E+00 | grad norm: 0.840 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2235/ 128728 | consumed samples: 35760 | consumed tokens: 73236480 | elapsed time per iteration (s): 15.21 | learning rate: 1.172E-05 | global batch size: 16 | lm loss: 6.381365E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2236/ 128728 | consumed samples: 35776 | consumed tokens: 73269248 | elapsed time per iteration (s): 15.23 | learning rate: 1.172E-05 | global batch size: 16 | lm loss: 6.291452E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2237/ 128728 | consumed samples: 35792 | consumed tokens: 73302016 | elapsed time per iteration (s): 15.21 | learning rate: 1.173E-05 | global batch size: 16 | lm loss: 6.277006E+00 | grad norm: 0.954 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2238/ 128728 | consumed samples: 35808 | consumed tokens: 73334784 | elapsed time per iteration (s): 15.22 | learning rate: 1.173E-05 | global batch size: 16 | lm loss: 6.583983E+00 | grad norm: 0.853 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2239/ 128728 | consumed samples: 35824 | consumed tokens: 73367552 | elapsed time per iteration (s): 15.24 | learning rate: 1.174E-05 | global batch size: 16 | lm loss: 6.101068E+00 | grad norm: 0.895 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2240/ 128728 | consumed samples: 35840 | consumed tokens: 73400320 | elapsed time per iteration (s): 15.23 | learning rate: 1.174E-05 | global batch size: 16 | lm loss: 6.559378E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2241/ 128728 | consumed samples: 35856 | consumed tokens: 73433088 | elapsed time per iteration (s): 15.23 | learning rate: 1.175E-05 | global batch size: 16 | lm loss: 6.321910E+00 | grad norm: 0.791 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2242/ 128728 | consumed samples: 35872 | consumed tokens: 73465856 | elapsed time per iteration (s): 15.23 | learning rate: 1.175E-05 | global batch size: 16 | lm loss: 6.386539E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2243/ 128728 | consumed samples: 35888 | consumed tokens: 73498624 | elapsed time per iteration (s): 15.20 | learning rate: 1.176E-05 | global batch size: 16 | lm loss: 6.276346E+00 | grad norm: 0.779 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2244/ 128728 | consumed samples: 35904 | consumed tokens: 73531392 | elapsed time per iteration (s): 15.19 | learning rate: 1.177E-05 | global batch size: 16 | lm loss: 6.231655E+00 | grad norm: 0.696 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2245/ 128728 | consumed samples: 35920 | consumed tokens: 73564160 | elapsed time per iteration (s): 15.22 | learning rate: 1.177E-05 | global batch size: 16 | lm loss: 6.294347E+00 | grad norm: 0.802 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2246/ 128728 | consumed samples: 35936 | consumed tokens: 73596928 | elapsed time per iteration (s): 15.17 | learning rate: 1.178E-05 | global batch size: 16 | lm loss: 6.297725E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2247/ 128728 | consumed samples: 35952 | consumed tokens: 73629696 | elapsed time per iteration (s): 15.20 | learning rate: 1.178E-05 | global batch size: 16 | lm loss: 6.356587E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2248/ 128728 | consumed samples: 35968 | consumed tokens: 73662464 | elapsed time per iteration (s): 15.21 | learning rate: 1.179E-05 | global batch size: 16 | lm loss: 6.241110E+00 | grad norm: 1.062 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2249/ 128728 | consumed samples: 35984 | consumed tokens: 73695232 | elapsed time per iteration (s): 15.21 | learning rate: 1.179E-05 | global batch size: 16 | lm loss: 6.421347E+00 | grad norm: 0.955 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2250/ 128728 | consumed samples: 36000 | consumed tokens: 73728000 | elapsed time per iteration (s): 15.24 | learning rate: 1.180E-05 | global batch size: 16 | lm loss: 6.559862E+00 | grad norm: 0.926 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2251/ 128728 | consumed samples: 36016 | consumed tokens: 73760768 | elapsed time per iteration (s): 15.24 | learning rate: 1.180E-05 | global batch size: 16 | lm loss: 6.585839E+00 | grad norm: 1.510 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2252/ 128728 | consumed samples: 36032 | consumed tokens: 73793536 | elapsed time per iteration (s): 15.16 | learning rate: 1.181E-05 | global batch size: 16 | lm loss: 6.288719E+00 | grad norm: 0.692 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 2253/ 128728 | consumed samples: 36048 | consumed tokens: 73826304 | elapsed time per iteration (s): 15.18 | learning rate: 1.181E-05 | global batch size: 16 | lm loss: 6.560059E+00 | grad norm: 0.858 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2254/ 128728 | consumed samples: 36064 | consumed tokens: 73859072 | elapsed time per iteration (s): 15.21 | learning rate: 1.182E-05 | global batch size: 16 | lm loss: 6.443803E+00 | grad norm: 0.917 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2255/ 128728 | consumed samples: 36080 | consumed tokens: 73891840 | elapsed time per iteration (s): 15.22 | learning rate: 1.182E-05 | global batch size: 16 | lm loss: 6.219304E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2256/ 128728 | consumed samples: 36096 | consumed tokens: 73924608 | elapsed time per iteration (s): 15.21 | learning rate: 1.183E-05 | global batch size: 16 | lm loss: 6.347414E+00 | grad norm: 0.855 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2257/ 128728 | consumed samples: 36112 | consumed tokens: 73957376 | elapsed time per iteration (s): 15.22 | learning rate: 1.183E-05 | global batch size: 16 | lm loss: 6.342593E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2258/ 128728 | consumed samples: 36128 | consumed tokens: 73990144 | elapsed time per iteration (s): 15.21 | learning rate: 1.184E-05 | global batch size: 16 | lm loss: 6.316047E+00 | grad norm: 0.902 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2259/ 128728 | consumed samples: 36144 | consumed tokens: 74022912 | elapsed time per iteration (s): 15.21 | learning rate: 1.184E-05 | global batch size: 16 | lm loss: 6.370636E+00 | grad norm: 0.832 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2260/ 128728 | consumed samples: 36160 | consumed tokens: 74055680 | elapsed time per iteration (s): 15.19 | learning rate: 1.185E-05 | global batch size: 16 | lm loss: 6.101759E+00 | grad norm: 0.765 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 2261/ 128728 | consumed samples: 36176 | consumed tokens: 74088448 | elapsed time per iteration (s): 15.18 | learning rate: 1.185E-05 | global batch size: 16 | lm loss: 6.264756E+00 | grad norm: 0.721 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2262/ 128728 | consumed samples: 36192 | consumed tokens: 74121216 | elapsed time per iteration (s): 15.23 | learning rate: 1.186E-05 | global batch size: 16 | lm loss: 6.437723E+00 | grad norm: 0.855 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2263/ 128728 | consumed samples: 36208 | consumed tokens: 74153984 | elapsed time per iteration (s): 15.17 | learning rate: 1.186E-05 | global batch size: 16 | lm loss: 6.398685E+00 | grad norm: 1.106 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 2264/ 128728 | consumed samples: 36224 | consumed tokens: 74186752 | elapsed time per iteration (s): 15.21 | learning rate: 1.187E-05 | global batch size: 16 | lm loss: 6.381065E+00 | grad norm: 0.872 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2265/ 128728 | consumed samples: 36240 | consumed tokens: 74219520 | elapsed time per iteration (s): 15.25 | learning rate: 1.188E-05 | global batch size: 16 | lm loss: 6.362085E+00 | grad norm: 0.832 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2266/ 128728 | consumed samples: 36256 | consumed tokens: 74252288 | elapsed time per iteration (s): 15.23 | learning rate: 1.188E-05 | global batch size: 16 | lm loss: 6.612569E+00 | grad norm: 1.086 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2267/ 128728 | consumed samples: 36272 | consumed tokens: 74285056 | elapsed time per iteration (s): 15.25 | learning rate: 1.189E-05 | global batch size: 16 | lm loss: 6.538249E+00 | grad norm: 0.961 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2268/ 128728 | consumed samples: 36288 | consumed tokens: 74317824 | elapsed time per iteration (s): 15.19 | learning rate: 1.189E-05 | global batch size: 16 | lm loss: 6.117253E+00 | grad norm: 1.745 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 2269/ 128728 | consumed samples: 36304 | consumed tokens: 74350592 | elapsed time per iteration (s): 15.16 | learning rate: 1.190E-05 | global batch size: 16 | lm loss: 6.429029E+00 | grad norm: 0.871 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 2270/ 128728 | consumed samples: 36320 | consumed tokens: 74383360 | elapsed time per iteration (s): 15.19 | learning rate: 1.190E-05 | global batch size: 16 | lm loss: 6.362271E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2271/ 128728 | consumed samples: 36336 | consumed tokens: 74416128 | elapsed time per iteration (s): 15.20 | learning rate: 1.191E-05 | global batch size: 16 | lm loss: 6.516022E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2272/ 128728 | consumed samples: 36352 | consumed tokens: 74448896 | elapsed time per iteration (s): 15.17 | learning rate: 1.191E-05 | global batch size: 16 | lm loss: 6.428764E+00 | grad norm: 1.823 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 2273/ 128728 | consumed samples: 36368 | consumed tokens: 74481664 | elapsed time per iteration (s): 15.20 | learning rate: 1.192E-05 | global batch size: 16 | lm loss: 6.327567E+00 | grad norm: 0.836 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2274/ 128728 | consumed samples: 36384 | consumed tokens: 74514432 | elapsed time per iteration (s): 15.16 | learning rate: 1.192E-05 | global batch size: 16 | lm loss: 6.334872E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2275/ 128728 | consumed samples: 36400 | consumed tokens: 74547200 | elapsed time per iteration (s): 15.24 | learning rate: 1.193E-05 | global batch size: 16 | lm loss: 6.308464E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2276/ 128728 | consumed samples: 36416 | consumed tokens: 74579968 | elapsed time per iteration (s): 15.21 | learning rate: 1.193E-05 | global batch size: 16 | lm loss: 6.263940E+00 | grad norm: 0.863 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2277/ 128728 | consumed samples: 36432 | consumed tokens: 74612736 | elapsed time per iteration (s): 15.25 | learning rate: 1.194E-05 | global batch size: 16 | lm loss: 6.259884E+00 | grad norm: 0.837 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2278/ 128728 | consumed samples: 36448 | consumed tokens: 74645504 | elapsed time per iteration (s): 15.20 | learning rate: 1.194E-05 | global batch size: 16 | lm loss: 6.369345E+00 | grad norm: 0.767 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2279/ 128728 | consumed samples: 36464 | consumed tokens: 74678272 | elapsed time per iteration (s): 15.21 | learning rate: 1.195E-05 | global batch size: 16 | lm loss: 6.319073E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2280/ 128728 | consumed samples: 36480 | consumed tokens: 74711040 | elapsed time per iteration (s): 15.15 | learning rate: 1.195E-05 | global batch size: 16 | lm loss: 6.353582E+00 | grad norm: 0.710 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 2281/ 128728 | consumed samples: 36496 | consumed tokens: 74743808 | elapsed time per iteration (s): 15.24 | learning rate: 1.196E-05 | global batch size: 16 | lm loss: 6.437267E+00 | grad norm: 0.886 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2282/ 128728 | consumed samples: 36512 | consumed tokens: 74776576 | elapsed time per iteration (s): 15.22 | learning rate: 1.196E-05 | global batch size: 16 | lm loss: 6.291619E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2283/ 128728 | consumed samples: 36528 | consumed tokens: 74809344 | elapsed time per iteration (s): 15.23 | learning rate: 1.197E-05 | global batch size: 16 | lm loss: 6.131433E+00 | grad norm: 0.819 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2284/ 128728 | consumed samples: 36544 | consumed tokens: 74842112 | elapsed time per iteration (s): 15.21 | learning rate: 1.197E-05 | global batch size: 16 | lm loss: 6.666663E+00 | grad norm: 0.935 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2285/ 128728 | consumed samples: 36560 | consumed tokens: 74874880 | elapsed time per iteration (s): 15.22 | learning rate: 1.198E-05 | global batch size: 16 | lm loss: 6.291386E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2286/ 128728 | consumed samples: 36576 | consumed tokens: 74907648 | elapsed time per iteration (s): 15.22 | learning rate: 1.199E-05 | global batch size: 16 | lm loss: 6.249954E+00 | grad norm: 0.728 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2287/ 128728 | consumed samples: 36592 | consumed tokens: 74940416 | elapsed time per iteration (s): 15.23 | learning rate: 1.199E-05 | global batch size: 16 | lm loss: 6.303566E+00 | grad norm: 0.804 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2288/ 128728 | consumed samples: 36608 | consumed tokens: 74973184 | elapsed time per iteration (s): 15.18 | learning rate: 1.200E-05 | global batch size: 16 | lm loss: 6.470012E+00 | grad norm: 0.854 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2289/ 128728 | consumed samples: 36624 | consumed tokens: 75005952 | elapsed time per iteration (s): 15.22 | learning rate: 1.200E-05 | global batch size: 16 | lm loss: 6.342841E+00 | grad norm: 0.788 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2290/ 128728 | consumed samples: 36640 | consumed tokens: 75038720 | elapsed time per iteration (s): 15.20 | learning rate: 1.201E-05 | global batch size: 16 | lm loss: 6.390831E+00 | grad norm: 0.682 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2291/ 128728 | consumed samples: 36656 | consumed tokens: 75071488 | elapsed time per iteration (s): 15.21 | learning rate: 1.201E-05 | global batch size: 16 | lm loss: 6.325696E+00 | grad norm: 0.843 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2292/ 128728 | consumed samples: 36672 | consumed tokens: 75104256 | elapsed time per iteration (s): 15.19 | learning rate: 1.202E-05 | global batch size: 16 | lm loss: 6.365410E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2293/ 128728 | consumed samples: 36688 | consumed tokens: 75137024 | elapsed time per iteration (s): 15.24 | learning rate: 1.202E-05 | global batch size: 16 | lm loss: 6.209689E+00 | grad norm: 0.889 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2294/ 128728 | consumed samples: 36704 | consumed tokens: 75169792 | elapsed time per iteration (s): 15.23 | learning rate: 1.203E-05 | global batch size: 16 | lm loss: 6.228543E+00 | grad norm: 0.806 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2295/ 128728 | consumed samples: 36720 | consumed tokens: 75202560 | elapsed time per iteration (s): 15.22 | learning rate: 1.203E-05 | global batch size: 16 | lm loss: 6.475939E+00 | grad norm: 0.736 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2296/ 128728 | consumed samples: 36736 | consumed tokens: 75235328 | elapsed time per iteration (s): 15.16 | learning rate: 1.204E-05 | global batch size: 16 | lm loss: 6.304015E+00 | grad norm: 0.766 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 2297/ 128728 | consumed samples: 36752 | consumed tokens: 75268096 | elapsed time per iteration (s): 15.20 | learning rate: 1.204E-05 | global batch size: 16 | lm loss: 6.079097E+00 | grad norm: 0.947 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2298/ 128728 | consumed samples: 36768 | consumed tokens: 75300864 | elapsed time per iteration (s): 15.20 | learning rate: 1.205E-05 | global batch size: 16 | lm loss: 6.455893E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2299/ 128728 | consumed samples: 36784 | consumed tokens: 75333632 | elapsed time per iteration (s): 15.22 | learning rate: 1.205E-05 | global batch size: 16 | lm loss: 6.587708E+00 | grad norm: 0.896 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2300/ 128728 | consumed samples: 36800 | consumed tokens: 75366400 | elapsed time per iteration (s): 15.17 | learning rate: 1.206E-05 | global batch size: 16 | lm loss: 6.351872E+00 | grad norm: 0.703 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 2301/ 128728 | consumed samples: 36816 | consumed tokens: 75399168 | elapsed time per iteration (s): 15.22 | learning rate: 1.206E-05 | global batch size: 16 | lm loss: 6.254686E+00 | grad norm: 0.699 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2302/ 128728 | consumed samples: 36832 | consumed tokens: 75431936 | elapsed time per iteration (s): 15.26 | learning rate: 1.207E-05 | global batch size: 16 | lm loss: 6.451199E+00 | grad norm: 1.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 2303/ 128728 | consumed samples: 36848 | consumed tokens: 75464704 | elapsed time per iteration (s): 15.22 | learning rate: 1.207E-05 | global batch size: 16 | lm loss: 6.339918E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2304/ 128728 | consumed samples: 36864 | consumed tokens: 75497472 | elapsed time per iteration (s): 15.20 | learning rate: 1.208E-05 | global batch size: 16 | lm loss: 6.407025E+00 | grad norm: 0.824 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2305/ 128728 | consumed samples: 36880 | consumed tokens: 75530240 | elapsed time per iteration (s): 15.20 | learning rate: 1.208E-05 | global batch size: 16 | lm loss: 6.282629E+00 | grad norm: 1.452 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2306/ 128728 | consumed samples: 36896 | consumed tokens: 75563008 | elapsed time per iteration (s): 15.18 | learning rate: 1.209E-05 | global batch size: 16 | lm loss: 6.182392E+00 | grad norm: 1.077 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2307/ 128728 | consumed samples: 36912 | consumed tokens: 75595776 | elapsed time per iteration (s): 15.20 | learning rate: 1.210E-05 | global batch size: 16 | lm loss: 6.384602E+00 | grad norm: 0.808 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2308/ 128728 | consumed samples: 36928 | consumed tokens: 75628544 | elapsed time per iteration (s): 15.27 | learning rate: 1.210E-05 | global batch size: 16 | lm loss: 6.367940E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 2309/ 128728 | consumed samples: 36944 | consumed tokens: 75661312 | elapsed time per iteration (s): 15.26 | learning rate: 1.211E-05 | global batch size: 16 | lm loss: 6.468973E+00 | grad norm: 0.878 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2310/ 128728 | consumed samples: 36960 | consumed tokens: 75694080 | elapsed time per iteration (s): 15.17 | learning rate: 1.211E-05 | global batch size: 16 | lm loss: 6.521853E+00 | grad norm: 0.726 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2311/ 128728 | consumed samples: 36976 | consumed tokens: 75726848 | elapsed time per iteration (s): 15.25 | learning rate: 1.212E-05 | global batch size: 16 | lm loss: 6.229692E+00 | grad norm: 1.296 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2312/ 128728 | consumed samples: 36992 | consumed tokens: 75759616 | elapsed time per iteration (s): 15.24 | learning rate: 1.212E-05 | global batch size: 16 | lm loss: 6.317523E+00 | grad norm: 0.750 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2313/ 128728 | consumed samples: 37008 | consumed tokens: 75792384 | elapsed time per iteration (s): 15.19 | learning rate: 1.213E-05 | global batch size: 16 | lm loss: 6.391045E+00 | grad norm: 0.933 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2314/ 128728 | consumed samples: 37024 | consumed tokens: 75825152 | elapsed time per iteration (s): 15.24 | learning rate: 1.213E-05 | global batch size: 16 | lm loss: 6.241301E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2315/ 128728 | consumed samples: 37040 | consumed tokens: 75857920 | elapsed time per iteration (s): 15.23 | learning rate: 1.214E-05 | global batch size: 16 | lm loss: 6.358777E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2316/ 128728 | consumed samples: 37056 | consumed tokens: 75890688 | elapsed time per iteration (s): 15.21 | learning rate: 1.214E-05 | global batch size: 16 | lm loss: 5.995783E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2317/ 128728 | consumed samples: 37072 | consumed tokens: 75923456 | elapsed time per iteration (s): 15.22 | learning rate: 1.215E-05 | global batch size: 16 | lm loss: 6.135524E+00 | grad norm: 0.916 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2318/ 128728 | consumed samples: 37088 | consumed tokens: 75956224 | elapsed time per iteration (s): 15.26 | learning rate: 1.215E-05 | global batch size: 16 | lm loss: 6.258219E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2319/ 128728 | consumed samples: 37104 | consumed tokens: 75988992 | elapsed time per iteration (s): 15.21 | learning rate: 1.216E-05 | global batch size: 16 | lm loss: 6.189133E+00 | grad norm: 0.687 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2320/ 128728 | consumed samples: 37120 | consumed tokens: 76021760 | elapsed time per iteration (s): 15.24 | learning rate: 1.216E-05 | global batch size: 16 | lm loss: 6.054904E+00 | grad norm: 0.948 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2321/ 128728 | consumed samples: 37136 | consumed tokens: 76054528 | elapsed time per iteration (s): 15.23 | learning rate: 1.217E-05 | global batch size: 16 | lm loss: 6.221347E+00 | grad norm: 0.802 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2322/ 128728 | consumed samples: 37152 | consumed tokens: 76087296 | elapsed time per iteration (s): 15.22 | learning rate: 1.217E-05 | global batch size: 16 | lm loss: 6.249143E+00 | grad norm: 0.739 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2323/ 128728 | consumed samples: 37168 | consumed tokens: 76120064 | elapsed time per iteration (s): 15.14 | learning rate: 1.218E-05 | global batch size: 16 | lm loss: 6.319250E+00 | grad norm: 0.901 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 2324/ 128728 | consumed samples: 37184 | consumed tokens: 76152832 | elapsed time per iteration (s): 15.16 | learning rate: 1.218E-05 | global batch size: 16 | lm loss: 6.291220E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2325/ 128728 | consumed samples: 37200 | consumed tokens: 76185600 | elapsed time per iteration (s): 15.27 | learning rate: 1.219E-05 | global batch size: 16 | lm loss: 6.504467E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 2326/ 128728 | consumed samples: 37216 | consumed tokens: 76218368 | elapsed time per iteration (s): 15.22 | learning rate: 1.219E-05 | global batch size: 16 | lm loss: 6.271231E+00 | grad norm: 0.978 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2327/ 128728 | consumed samples: 37232 | consumed tokens: 76251136 | elapsed time per iteration (s): 15.23 | learning rate: 1.220E-05 | global batch size: 16 | lm loss: 6.325077E+00 | grad norm: 0.678 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2328/ 128728 | consumed samples: 37248 | consumed tokens: 76283904 | elapsed time per iteration (s): 15.19 | learning rate: 1.221E-05 | global batch size: 16 | lm loss: 6.327701E+00 | grad norm: 0.824 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2329/ 128728 | consumed samples: 37264 | consumed tokens: 76316672 | elapsed time per iteration (s): 15.24 | learning rate: 1.221E-05 | global batch size: 16 | lm loss: 6.261258E+00 | grad norm: 1.054 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2330/ 128728 | consumed samples: 37280 | consumed tokens: 76349440 | elapsed time per iteration (s): 15.23 | learning rate: 1.222E-05 | global batch size: 16 | lm loss: 6.183227E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2331/ 128728 | consumed samples: 37296 | consumed tokens: 76382208 | elapsed time per iteration (s): 15.22 | learning rate: 1.222E-05 | global batch size: 16 | lm loss: 6.479833E+00 | grad norm: 0.759 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2332/ 128728 | consumed samples: 37312 | consumed tokens: 76414976 | elapsed time per iteration (s): 15.21 | learning rate: 1.223E-05 | global batch size: 16 | lm loss: 6.230041E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2333/ 128728 | consumed samples: 37328 | consumed tokens: 76447744 | elapsed time per iteration (s): 15.21 | learning rate: 1.223E-05 | global batch size: 16 | lm loss: 6.235174E+00 | grad norm: 0.772 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2334/ 128728 | consumed samples: 37344 | consumed tokens: 76480512 | elapsed time per iteration (s): 15.23 | learning rate: 1.224E-05 | global batch size: 16 | lm loss: 6.161546E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2335/ 128728 | consumed samples: 37360 | consumed tokens: 76513280 | elapsed time per iteration (s): 15.18 | learning rate: 1.224E-05 | global batch size: 16 | lm loss: 6.356600E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2336/ 128728 | consumed samples: 37376 | consumed tokens: 76546048 | elapsed time per iteration (s): 15.30 | learning rate: 1.225E-05 | global batch size: 16 | lm loss: 6.117655E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.046 | TFLOPs: 8.01 | [default7]: iteration 2337/ 128728 | consumed samples: 37392 | consumed tokens: 76578816 | elapsed time per iteration (s): 15.25 | learning rate: 1.225E-05 | global batch size: 16 | lm loss: 6.201807E+00 | grad norm: 0.885 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2338/ 128728 | consumed samples: 37408 | consumed tokens: 76611584 | elapsed time per iteration (s): 15.25 | learning rate: 1.226E-05 | global batch size: 16 | lm loss: 6.379524E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2339/ 128728 | consumed samples: 37424 | consumed tokens: 76644352 | elapsed time per iteration (s): 15.19 | learning rate: 1.226E-05 | global batch size: 16 | lm loss: 6.343836E+00 | grad norm: 0.870 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2340/ 128728 | consumed samples: 37440 | consumed tokens: 76677120 | elapsed time per iteration (s): 15.24 | learning rate: 1.227E-05 | global batch size: 16 | lm loss: 6.259268E+00 | grad norm: 0.699 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2341/ 128728 | consumed samples: 37456 | consumed tokens: 76709888 | elapsed time per iteration (s): 15.27 | learning rate: 1.227E-05 | global batch size: 16 | lm loss: 6.352734E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 2342/ 128728 | consumed samples: 37472 | consumed tokens: 76742656 | elapsed time per iteration (s): 15.26 | learning rate: 1.228E-05 | global batch size: 16 | lm loss: 6.273996E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 2343/ 128728 | consumed samples: 37488 | consumed tokens: 76775424 | elapsed time per iteration (s): 15.21 | learning rate: 1.228E-05 | global batch size: 16 | lm loss: 6.341601E+00 | grad norm: 0.726 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2344/ 128728 | consumed samples: 37504 | consumed tokens: 76808192 | elapsed time per iteration (s): 15.23 | learning rate: 1.229E-05 | global batch size: 16 | lm loss: 6.245977E+00 | grad norm: 0.678 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2345/ 128728 | consumed samples: 37520 | consumed tokens: 76840960 | elapsed time per iteration (s): 15.24 | learning rate: 1.229E-05 | global batch size: 16 | lm loss: 6.364078E+00 | grad norm: 0.837 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2346/ 128728 | consumed samples: 37536 | consumed tokens: 76873728 | elapsed time per iteration (s): 15.27 | learning rate: 1.230E-05 | global batch size: 16 | lm loss: 6.206321E+00 | grad norm: 1.079 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 2347/ 128728 | consumed samples: 37552 | consumed tokens: 76906496 | elapsed time per iteration (s): 15.24 | learning rate: 1.231E-05 | global batch size: 16 | lm loss: 6.211255E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2348/ 128728 | consumed samples: 37568 | consumed tokens: 76939264 | elapsed time per iteration (s): 15.17 | learning rate: 1.231E-05 | global batch size: 16 | lm loss: 6.149254E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2349/ 128728 | consumed samples: 37584 | consumed tokens: 76972032 | elapsed time per iteration (s): 15.22 | learning rate: 1.232E-05 | global batch size: 16 | lm loss: 6.326467E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2350/ 128728 | consumed samples: 37600 | consumed tokens: 77004800 | elapsed time per iteration (s): 15.21 | learning rate: 1.232E-05 | global batch size: 16 | lm loss: 6.407173E+00 | grad norm: 0.814 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2351/ 128728 | consumed samples: 37616 | consumed tokens: 77037568 | elapsed time per iteration (s): 15.24 | learning rate: 1.233E-05 | global batch size: 16 | lm loss: 6.484642E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2352/ 128728 | consumed samples: 37632 | consumed tokens: 77070336 | elapsed time per iteration (s): 15.23 | learning rate: 1.233E-05 | global batch size: 16 | lm loss: 6.147301E+00 | grad norm: 0.696 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2353/ 128728 | consumed samples: 37648 | consumed tokens: 77103104 | elapsed time per iteration (s): 15.16 | learning rate: 1.234E-05 | global batch size: 16 | lm loss: 6.507727E+00 | grad norm: 0.689 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2354/ 128728 | consumed samples: 37664 | consumed tokens: 77135872 | elapsed time per iteration (s): 15.25 | learning rate: 1.234E-05 | global batch size: 16 | lm loss: 6.197268E+00 | grad norm: 0.678 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2355/ 128728 | consumed samples: 37680 | consumed tokens: 77168640 | elapsed time per iteration (s): 15.22 | learning rate: 1.235E-05 | global batch size: 16 | lm loss: 6.132536E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2356/ 128728 | consumed samples: 37696 | consumed tokens: 77201408 | elapsed time per iteration (s): 15.25 | learning rate: 1.235E-05 | global batch size: 16 | lm loss: 6.288426E+00 | grad norm: 0.690 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2357/ 128728 | consumed samples: 37712 | consumed tokens: 77234176 | elapsed time per iteration (s): 15.23 | learning rate: 1.236E-05 | global batch size: 16 | lm loss: 6.204188E+00 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2358/ 128728 | consumed samples: 37728 | consumed tokens: 77266944 | elapsed time per iteration (s): 15.20 | learning rate: 1.236E-05 | global batch size: 16 | lm loss: 6.382045E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2359/ 128728 | consumed samples: 37744 | consumed tokens: 77299712 | elapsed time per iteration (s): 15.19 | learning rate: 1.237E-05 | global batch size: 16 | lm loss: 6.236710E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2360/ 128728 | consumed samples: 37760 | consumed tokens: 77332480 | elapsed time per iteration (s): 15.23 | learning rate: 1.237E-05 | global batch size: 16 | lm loss: 6.214093E+00 | grad norm: 0.850 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2361/ 128728 | consumed samples: 37776 | consumed tokens: 77365248 | elapsed time per iteration (s): 15.24 | learning rate: 1.238E-05 | global batch size: 16 | lm loss: 6.367528E+00 | grad norm: 1.105 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2362/ 128728 | consumed samples: 37792 | consumed tokens: 77398016 | elapsed time per iteration (s): 15.22 | learning rate: 1.238E-05 | global batch size: 16 | lm loss: 6.165552E+00 | grad norm: 0.870 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2363/ 128728 | consumed samples: 37808 | consumed tokens: 77430784 | elapsed time per iteration (s): 15.20 | learning rate: 1.239E-05 | global batch size: 16 | lm loss: 6.096678E+00 | grad norm: 0.813 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2364/ 128728 | consumed samples: 37824 | consumed tokens: 77463552 | elapsed time per iteration (s): 15.21 | learning rate: 1.239E-05 | global batch size: 16 | lm loss: 6.199354E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2365/ 128728 | consumed samples: 37840 | consumed tokens: 77496320 | elapsed time per iteration (s): 15.20 | learning rate: 1.240E-05 | global batch size: 16 | lm loss: 6.324061E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2366/ 128728 | consumed samples: 37856 | consumed tokens: 77529088 | elapsed time per iteration (s): 15.25 | learning rate: 1.240E-05 | global batch size: 16 | lm loss: 6.271080E+00 | grad norm: 1.050 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2367/ 128728 | consumed samples: 37872 | consumed tokens: 77561856 | elapsed time per iteration (s): 15.24 | learning rate: 1.241E-05 | global batch size: 16 | lm loss: 6.355011E+00 | grad norm: 0.786 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2368/ 128728 | consumed samples: 37888 | consumed tokens: 77594624 | elapsed time per iteration (s): 15.23 | learning rate: 1.242E-05 | global batch size: 16 | lm loss: 6.325652E+00 | grad norm: 0.775 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2369/ 128728 | consumed samples: 37904 | consumed tokens: 77627392 | elapsed time per iteration (s): 15.25 | learning rate: 1.242E-05 | global batch size: 16 | lm loss: 6.347246E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2370/ 128728 | consumed samples: 37920 | consumed tokens: 77660160 | elapsed time per iteration (s): 15.23 | learning rate: 1.243E-05 | global batch size: 16 | lm loss: 6.249259E+00 | grad norm: 0.778 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2371/ 128728 | consumed samples: 37936 | consumed tokens: 77692928 | elapsed time per iteration (s): 15.22 | learning rate: 1.243E-05 | global batch size: 16 | lm loss: 6.194085E+00 | grad norm: 0.663 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2372/ 128728 | consumed samples: 37952 | consumed tokens: 77725696 | elapsed time per iteration (s): 15.25 | learning rate: 1.244E-05 | global batch size: 16 | lm loss: 6.347694E+00 | grad norm: 1.007 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2373/ 128728 | consumed samples: 37968 | consumed tokens: 77758464 | elapsed time per iteration (s): 15.24 | learning rate: 1.244E-05 | global batch size: 16 | lm loss: 6.205462E+00 | grad norm: 0.755 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2374/ 128728 | consumed samples: 37984 | consumed tokens: 77791232 | elapsed time per iteration (s): 15.18 | learning rate: 1.245E-05 | global batch size: 16 | lm loss: 6.502585E+00 | grad norm: 0.840 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2375/ 128728 | consumed samples: 38000 | consumed tokens: 77824000 | elapsed time per iteration (s): 15.14 | learning rate: 1.245E-05 | global batch size: 16 | lm loss: 6.250050E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 2376/ 128728 | consumed samples: 38016 | consumed tokens: 77856768 | elapsed time per iteration (s): 15.22 | learning rate: 1.246E-05 | global batch size: 16 | lm loss: 6.314306E+00 | grad norm: 0.933 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2377/ 128728 | consumed samples: 38032 | consumed tokens: 77889536 | elapsed time per iteration (s): 15.19 | learning rate: 1.246E-05 | global batch size: 16 | lm loss: 6.403317E+00 | grad norm: 1.015 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 2378/ 128728 | consumed samples: 38048 | consumed tokens: 77922304 | elapsed time per iteration (s): 15.17 | learning rate: 1.247E-05 | global batch size: 16 | lm loss: 6.022367E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2379/ 128728 | consumed samples: 38064 | consumed tokens: 77955072 | elapsed time per iteration (s): 15.25 | learning rate: 1.247E-05 | global batch size: 16 | lm loss: 6.283350E+00 | grad norm: 0.906 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2380/ 128728 | consumed samples: 38080 | consumed tokens: 77987840 | elapsed time per iteration (s): 15.20 | learning rate: 1.248E-05 | global batch size: 16 | lm loss: 6.473180E+00 | grad norm: 0.983 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2381/ 128728 | consumed samples: 38096 | consumed tokens: 78020608 | elapsed time per iteration (s): 15.23 | learning rate: 1.248E-05 | global batch size: 16 | lm loss: 6.159327E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2382/ 128728 | consumed samples: 38112 | consumed tokens: 78053376 | elapsed time per iteration (s): 15.26 | learning rate: 1.249E-05 | global batch size: 16 | lm loss: 6.267680E+00 | grad norm: 2.109 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2383/ 128728 | consumed samples: 38128 | consumed tokens: 78086144 | elapsed time per iteration (s): 15.25 | learning rate: 1.249E-05 | global batch size: 16 | lm loss: 6.295687E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2384/ 128728 | consumed samples: 38144 | consumed tokens: 78118912 | elapsed time per iteration (s): 15.21 | learning rate: 1.250E-05 | global batch size: 16 | lm loss: 6.494872E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2385/ 128728 | consumed samples: 38160 | consumed tokens: 78151680 | elapsed time per iteration (s): 15.23 | learning rate: 1.250E-05 | global batch size: 16 | lm loss: 6.360726E+00 | grad norm: 1.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2386/ 128728 | consumed samples: 38176 | consumed tokens: 78184448 | elapsed time per iteration (s): 15.25 | learning rate: 1.251E-05 | global batch size: 16 | lm loss: 6.168993E+00 | grad norm: 0.780 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.04 | [default7]: iteration 2387/ 128728 | consumed samples: 38192 | consumed tokens: 78217216 | elapsed time per iteration (s): 15.24 | learning rate: 1.251E-05 | global batch size: 16 | lm loss: 6.293113E+00 | grad norm: 0.786 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2388/ 128728 | consumed samples: 38208 | consumed tokens: 78249984 | elapsed time per iteration (s): 15.28 | learning rate: 1.252E-05 | global batch size: 16 | lm loss: 6.249125E+00 | grad norm: 0.788 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 2389/ 128728 | consumed samples: 38224 | consumed tokens: 78282752 | elapsed time per iteration (s): 15.25 | learning rate: 1.253E-05 | global batch size: 16 | lm loss: 6.292615E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2390/ 128728 | consumed samples: 38240 | consumed tokens: 78315520 | elapsed time per iteration (s): 15.19 | learning rate: 1.253E-05 | global batch size: 16 | lm loss: 6.254043E+00 | grad norm: 0.695 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2391/ 128728 | consumed samples: 38256 | consumed tokens: 78348288 | elapsed time per iteration (s): 15.26 | learning rate: 1.254E-05 | global batch size: 16 | lm loss: 6.146667E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2392/ 128728 | consumed samples: 38272 | consumed tokens: 78381056 | elapsed time per iteration (s): 15.25 | learning rate: 1.254E-05 | global batch size: 16 | lm loss: 6.178244E+00 | grad norm: 0.750 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2393/ 128728 | consumed samples: 38288 | consumed tokens: 78413824 | elapsed time per iteration (s): 15.22 | learning rate: 1.255E-05 | global batch size: 16 | lm loss: 6.237836E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2394/ 128728 | consumed samples: 38304 | consumed tokens: 78446592 | elapsed time per iteration (s): 15.21 | learning rate: 1.255E-05 | global batch size: 16 | lm loss: 6.309330E+00 | grad norm: 0.695 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2395/ 128728 | consumed samples: 38320 | consumed tokens: 78479360 | elapsed time per iteration (s): 15.19 | learning rate: 1.256E-05 | global batch size: 16 | lm loss: 6.187210E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2396/ 128728 | consumed samples: 38336 | consumed tokens: 78512128 | elapsed time per iteration (s): 15.24 | learning rate: 1.256E-05 | global batch size: 16 | lm loss: 6.128231E+00 | grad norm: 0.670 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2397/ 128728 | consumed samples: 38352 | consumed tokens: 78544896 | elapsed time per iteration (s): 15.25 | learning rate: 1.257E-05 | global batch size: 16 | lm loss: 6.318326E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2398/ 128728 | consumed samples: 38368 | consumed tokens: 78577664 | elapsed time per iteration (s): 15.20 | learning rate: 1.257E-05 | global batch size: 16 | lm loss: 6.380693E+00 | grad norm: 0.757 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2399/ 128728 | consumed samples: 38384 | consumed tokens: 78610432 | elapsed time per iteration (s): 15.23 | learning rate: 1.258E-05 | global batch size: 16 | lm loss: 6.091791E+00 | grad norm: 0.696 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2400/ 128728 | consumed samples: 38400 | consumed tokens: 78643200 | elapsed time per iteration (s): 15.19 | learning rate: 1.258E-05 | global batch size: 16 | lm loss: 6.328229E+00 | grad norm: 0.788 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 2401/ 128728 | consumed samples: 38416 | consumed tokens: 78675968 | elapsed time per iteration (s): 15.21 | learning rate: 1.259E-05 | global batch size: 16 | lm loss: 6.066146E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2402/ 128728 | consumed samples: 38432 | consumed tokens: 78708736 | elapsed time per iteration (s): 15.23 | learning rate: 1.259E-05 | global batch size: 16 | lm loss: 6.242963E+00 | grad norm: 0.727 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2403/ 128728 | consumed samples: 38448 | consumed tokens: 78741504 | elapsed time per iteration (s): 15.22 | learning rate: 1.260E-05 | global batch size: 16 | lm loss: 6.153259E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2404/ 128728 | consumed samples: 38464 | consumed tokens: 78774272 | elapsed time per iteration (s): 15.21 | learning rate: 1.260E-05 | global batch size: 16 | lm loss: 6.031731E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2405/ 128728 | consumed samples: 38480 | consumed tokens: 78807040 | elapsed time per iteration (s): 15.28 | learning rate: 1.261E-05 | global batch size: 16 | lm loss: 6.312222E+00 | grad norm: 1.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 2406/ 128728 | consumed samples: 38496 | consumed tokens: 78839808 | elapsed time per iteration (s): 15.25 | learning rate: 1.261E-05 | global batch size: 16 | lm loss: 6.096965E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2407/ 128728 | consumed samples: 38512 | consumed tokens: 78872576 | elapsed time per iteration (s): 15.22 | learning rate: 1.262E-05 | global batch size: 16 | lm loss: 6.310668E+00 | grad norm: 0.925 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2408/ 128728 | consumed samples: 38528 | consumed tokens: 78905344 | elapsed time per iteration (s): 15.22 | learning rate: 1.262E-05 | global batch size: 16 | lm loss: 6.245684E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2409/ 128728 | consumed samples: 38544 | consumed tokens: 78938112 | elapsed time per iteration (s): 15.25 | learning rate: 1.263E-05 | global batch size: 16 | lm loss: 6.369366E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2410/ 128728 | consumed samples: 38560 | consumed tokens: 78970880 | elapsed time per iteration (s): 15.18 | learning rate: 1.264E-05 | global batch size: 16 | lm loss: 6.178571E+00 | grad norm: 0.707 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2411/ 128728 | consumed samples: 38576 | consumed tokens: 79003648 | elapsed time per iteration (s): 15.18 | learning rate: 1.264E-05 | global batch size: 16 | lm loss: 6.309995E+00 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2412/ 128728 | consumed samples: 38592 | consumed tokens: 79036416 | elapsed time per iteration (s): 15.24 | learning rate: 1.265E-05 | global batch size: 16 | lm loss: 6.512557E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2413/ 128728 | consumed samples: 38608 | consumed tokens: 79069184 | elapsed time per iteration (s): 15.26 | learning rate: 1.265E-05 | global batch size: 16 | lm loss: 6.460606E+00 | grad norm: 0.690 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2414/ 128728 | consumed samples: 38624 | consumed tokens: 79101952 | elapsed time per iteration (s): 15.22 | learning rate: 1.266E-05 | global batch size: 16 | lm loss: 6.191257E+00 | grad norm: 0.686 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2415/ 128728 | consumed samples: 38640 | consumed tokens: 79134720 | elapsed time per iteration (s): 15.23 | learning rate: 1.266E-05 | global batch size: 16 | lm loss: 6.224933E+00 | grad norm: 0.773 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2416/ 128728 | consumed samples: 38656 | consumed tokens: 79167488 | elapsed time per iteration (s): 15.20 | learning rate: 1.267E-05 | global batch size: 16 | lm loss: 6.085470E+00 | grad norm: 0.708 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2417/ 128728 | consumed samples: 38672 | consumed tokens: 79200256 | elapsed time per iteration (s): 15.21 | learning rate: 1.267E-05 | global batch size: 16 | lm loss: 6.211289E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2418/ 128728 | consumed samples: 38688 | consumed tokens: 79233024 | elapsed time per iteration (s): 15.23 | learning rate: 1.268E-05 | global batch size: 16 | lm loss: 6.217436E+00 | grad norm: 0.902 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2419/ 128728 | consumed samples: 38704 | consumed tokens: 79265792 | elapsed time per iteration (s): 15.24 | learning rate: 1.268E-05 | global batch size: 16 | lm loss: 6.188772E+00 | grad norm: 0.838 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2420/ 128728 | consumed samples: 38720 | consumed tokens: 79298560 | elapsed time per iteration (s): 15.16 | learning rate: 1.269E-05 | global batch size: 16 | lm loss: 6.116030E+00 | grad norm: 1.326 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2421/ 128728 | consumed samples: 38736 | consumed tokens: 79331328 | elapsed time per iteration (s): 15.18 | learning rate: 1.269E-05 | global batch size: 16 | lm loss: 6.088687E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2422/ 128728 | consumed samples: 38752 | consumed tokens: 79364096 | elapsed time per iteration (s): 15.20 | learning rate: 1.270E-05 | global batch size: 16 | lm loss: 6.145873E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2423/ 128728 | consumed samples: 38768 | consumed tokens: 79396864 | elapsed time per iteration (s): 15.23 | learning rate: 1.270E-05 | global batch size: 16 | lm loss: 6.244073E+00 | grad norm: 1.080 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2424/ 128728 | consumed samples: 38784 | consumed tokens: 79429632 | elapsed time per iteration (s): 15.19 | learning rate: 1.271E-05 | global batch size: 16 | lm loss: 6.340772E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2425/ 128728 | consumed samples: 38800 | consumed tokens: 79462400 | elapsed time per iteration (s): 15.16 | learning rate: 1.271E-05 | global batch size: 16 | lm loss: 6.090518E+00 | grad norm: 0.823 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 2426/ 128728 | consumed samples: 38816 | consumed tokens: 79495168 | elapsed time per iteration (s): 15.23 | learning rate: 1.272E-05 | global batch size: 16 | lm loss: 6.469316E+00 | grad norm: 0.799 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2427/ 128728 | consumed samples: 38832 | consumed tokens: 79527936 | elapsed time per iteration (s): 15.16 | learning rate: 1.272E-05 | global batch size: 16 | lm loss: 6.311891E+00 | grad norm: 0.883 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 2428/ 128728 | consumed samples: 38848 | consumed tokens: 79560704 | elapsed time per iteration (s): 15.16 | learning rate: 1.273E-05 | global batch size: 16 | lm loss: 6.165546E+00 | grad norm: 0.811 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2429/ 128728 | consumed samples: 38864 | consumed tokens: 79593472 | elapsed time per iteration (s): 15.24 | learning rate: 1.273E-05 | global batch size: 16 | lm loss: 6.298426E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2430/ 128728 | consumed samples: 38880 | consumed tokens: 79626240 | elapsed time per iteration (s): 15.22 | learning rate: 1.274E-05 | global batch size: 16 | lm loss: 6.042541E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2431/ 128728 | consumed samples: 38896 | consumed tokens: 79659008 | elapsed time per iteration (s): 15.19 | learning rate: 1.275E-05 | global batch size: 16 | lm loss: 6.275948E+00 | grad norm: 0.678 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2432/ 128728 | consumed samples: 38912 | consumed tokens: 79691776 | elapsed time per iteration (s): 15.24 | learning rate: 1.275E-05 | global batch size: 16 | lm loss: 6.168737E+00 | grad norm: 0.896 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2433/ 128728 | consumed samples: 38928 | consumed tokens: 79724544 | elapsed time per iteration (s): 15.21 | learning rate: 1.276E-05 | global batch size: 16 | lm loss: 6.469221E+00 | grad norm: 0.953 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2434/ 128728 | consumed samples: 38944 | consumed tokens: 79757312 | elapsed time per iteration (s): 15.23 | learning rate: 1.276E-05 | global batch size: 16 | lm loss: 6.241963E+00 | grad norm: 0.688 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2435/ 128728 | consumed samples: 38960 | consumed tokens: 79790080 | elapsed time per iteration (s): 15.23 | learning rate: 1.277E-05 | global batch size: 16 | lm loss: 6.322588E+00 | grad norm: 0.951 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2436/ 128728 | consumed samples: 38976 | consumed tokens: 79822848 | elapsed time per iteration (s): 15.21 | learning rate: 1.277E-05 | global batch size: 16 | lm loss: 6.185337E+00 | grad norm: 0.934 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2437/ 128728 | consumed samples: 38992 | consumed tokens: 79855616 | elapsed time per iteration (s): 15.22 | learning rate: 1.278E-05 | global batch size: 16 | lm loss: 6.192573E+00 | grad norm: 0.690 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2438/ 128728 | consumed samples: 39008 | consumed tokens: 79888384 | elapsed time per iteration (s): 15.25 | learning rate: 1.278E-05 | global batch size: 16 | lm loss: 6.097382E+00 | grad norm: 1.021 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2439/ 128728 | consumed samples: 39024 | consumed tokens: 79921152 | elapsed time per iteration (s): 15.22 | learning rate: 1.279E-05 | global batch size: 16 | lm loss: 6.090995E+00 | grad norm: 0.960 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2440/ 128728 | consumed samples: 39040 | consumed tokens: 79953920 | elapsed time per iteration (s): 15.24 | learning rate: 1.279E-05 | global batch size: 16 | lm loss: 6.367899E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2441/ 128728 | consumed samples: 39056 | consumed tokens: 79986688 | elapsed time per iteration (s): 15.22 | learning rate: 1.280E-05 | global batch size: 16 | lm loss: 6.321862E+00 | grad norm: 1.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2442/ 128728 | consumed samples: 39072 | consumed tokens: 80019456 | elapsed time per iteration (s): 15.24 | learning rate: 1.280E-05 | global batch size: 16 | lm loss: 6.289917E+00 | grad norm: 0.767 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2443/ 128728 | consumed samples: 39088 | consumed tokens: 80052224 | elapsed time per iteration (s): 15.22 | learning rate: 1.281E-05 | global batch size: 16 | lm loss: 6.321412E+00 | grad norm: 0.855 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2444/ 128728 | consumed samples: 39104 | consumed tokens: 80084992 | elapsed time per iteration (s): 15.25 | learning rate: 1.281E-05 | global batch size: 16 | lm loss: 6.268347E+00 | grad norm: 0.836 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2445/ 128728 | consumed samples: 39120 | consumed tokens: 80117760 | elapsed time per iteration (s): 15.26 | learning rate: 1.282E-05 | global batch size: 16 | lm loss: 6.359283E+00 | grad norm: 1.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 2446/ 128728 | consumed samples: 39136 | consumed tokens: 80150528 | elapsed time per iteration (s): 15.21 | learning rate: 1.282E-05 | global batch size: 16 | lm loss: 6.297738E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2447/ 128728 | consumed samples: 39152 | consumed tokens: 80183296 | elapsed time per iteration (s): 15.21 | learning rate: 1.283E-05 | global batch size: 16 | lm loss: 6.223781E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2448/ 128728 | consumed samples: 39168 | consumed tokens: 80216064 | elapsed time per iteration (s): 15.22 | learning rate: 1.283E-05 | global batch size: 16 | lm loss: 6.075761E+00 | grad norm: 0.738 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2449/ 128728 | consumed samples: 39184 | consumed tokens: 80248832 | elapsed time per iteration (s): 15.23 | learning rate: 1.284E-05 | global batch size: 16 | lm loss: 6.331431E+00 | grad norm: 0.798 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2450/ 128728 | consumed samples: 39200 | consumed tokens: 80281600 | elapsed time per iteration (s): 15.21 | learning rate: 1.285E-05 | global batch size: 16 | lm loss: 6.386670E+00 | grad norm: 0.912 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2451/ 128728 | consumed samples: 39216 | consumed tokens: 80314368 | elapsed time per iteration (s): 15.24 | learning rate: 1.285E-05 | global batch size: 16 | lm loss: 5.956208E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2452/ 128728 | consumed samples: 39232 | consumed tokens: 80347136 | elapsed time per iteration (s): 15.19 | learning rate: 1.286E-05 | global batch size: 16 | lm loss: 6.244837E+00 | grad norm: 0.690 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2453/ 128728 | consumed samples: 39248 | consumed tokens: 80379904 | elapsed time per iteration (s): 15.21 | learning rate: 1.286E-05 | global batch size: 16 | lm loss: 6.260391E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2454/ 128728 | consumed samples: 39264 | consumed tokens: 80412672 | elapsed time per iteration (s): 15.19 | learning rate: 1.287E-05 | global batch size: 16 | lm loss: 6.223280E+00 | grad norm: 0.927 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2455/ 128728 | consumed samples: 39280 | consumed tokens: 80445440 | elapsed time per iteration (s): 15.21 | learning rate: 1.287E-05 | global batch size: 16 | lm loss: 6.260831E+00 | grad norm: 0.685 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2456/ 128728 | consumed samples: 39296 | consumed tokens: 80478208 | elapsed time per iteration (s): 15.19 | learning rate: 1.288E-05 | global batch size: 16 | lm loss: 6.206467E+00 | grad norm: 0.935 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2457/ 128728 | consumed samples: 39312 | consumed tokens: 80510976 | elapsed time per iteration (s): 15.22 | learning rate: 1.288E-05 | global batch size: 16 | lm loss: 6.305583E+00 | grad norm: 0.783 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2458/ 128728 | consumed samples: 39328 | consumed tokens: 80543744 | elapsed time per iteration (s): 15.25 | learning rate: 1.289E-05 | global batch size: 16 | lm loss: 6.003456E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2459/ 128728 | consumed samples: 39344 | consumed tokens: 80576512 | elapsed time per iteration (s): 15.16 | learning rate: 1.289E-05 | global batch size: 16 | lm loss: 6.305748E+00 | grad norm: 1.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2460/ 128728 | consumed samples: 39360 | consumed tokens: 80609280 | elapsed time per iteration (s): 15.25 | learning rate: 1.290E-05 | global batch size: 16 | lm loss: 6.356349E+00 | grad norm: 0.702 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2461/ 128728 | consumed samples: 39376 | consumed tokens: 80642048 | elapsed time per iteration (s): 15.24 | learning rate: 1.290E-05 | global batch size: 16 | lm loss: 6.082371E+00 | grad norm: 1.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2462/ 128728 | consumed samples: 39392 | consumed tokens: 80674816 | elapsed time per iteration (s): 15.21 | learning rate: 1.291E-05 | global batch size: 16 | lm loss: 6.293061E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2463/ 128728 | consumed samples: 39408 | consumed tokens: 80707584 | elapsed time per iteration (s): 15.20 | learning rate: 1.291E-05 | global batch size: 16 | lm loss: 6.216317E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2464/ 128728 | consumed samples: 39424 | consumed tokens: 80740352 | elapsed time per iteration (s): 15.22 | learning rate: 1.292E-05 | global batch size: 16 | lm loss: 6.274666E+00 | grad norm: 0.795 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2465/ 128728 | consumed samples: 39440 | consumed tokens: 80773120 | elapsed time per iteration (s): 15.28 | learning rate: 1.292E-05 | global batch size: 16 | lm loss: 6.314239E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 2466/ 128728 | consumed samples: 39456 | consumed tokens: 80805888 | elapsed time per iteration (s): 15.21 | learning rate: 1.293E-05 | global batch size: 16 | lm loss: 6.179266E+00 | grad norm: 0.708 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2467/ 128728 | consumed samples: 39472 | consumed tokens: 80838656 | elapsed time per iteration (s): 15.22 | learning rate: 1.293E-05 | global batch size: 16 | lm loss: 6.121453E+00 | grad norm: 0.700 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2468/ 128728 | consumed samples: 39488 | consumed tokens: 80871424 | elapsed time per iteration (s): 15.24 | learning rate: 1.294E-05 | global batch size: 16 | lm loss: 6.419597E+00 | grad norm: 0.810 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2469/ 128728 | consumed samples: 39504 | consumed tokens: 80904192 | elapsed time per iteration (s): 15.27 | learning rate: 1.294E-05 | global batch size: 16 | lm loss: 6.172673E+00 | grad norm: 0.866 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 2470/ 128728 | consumed samples: 39520 | consumed tokens: 80936960 | elapsed time per iteration (s): 15.24 | learning rate: 1.295E-05 | global batch size: 16 | lm loss: 6.166053E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2471/ 128728 | consumed samples: 39536 | consumed tokens: 80969728 | elapsed time per iteration (s): 15.26 | learning rate: 1.296E-05 | global batch size: 16 | lm loss: 6.552093E+00 | grad norm: 0.961 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 2472/ 128728 | consumed samples: 39552 | consumed tokens: 81002496 | elapsed time per iteration (s): 15.26 | learning rate: 1.296E-05 | global batch size: 16 | lm loss: 6.085385E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2473/ 128728 | consumed samples: 39568 | consumed tokens: 81035264 | elapsed time per iteration (s): 15.20 | learning rate: 1.297E-05 | global batch size: 16 | lm loss: 6.246649E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2474/ 128728 | consumed samples: 39584 | consumed tokens: 81068032 | elapsed time per iteration (s): 15.25 | learning rate: 1.297E-05 | global batch size: 16 | lm loss: 6.106105E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.04 | [default7]: iteration 2475/ 128728 | consumed samples: 39600 | consumed tokens: 81100800 | elapsed time per iteration (s): 15.16 | learning rate: 1.298E-05 | global batch size: 16 | lm loss: 5.814936E+00 | grad norm: 0.757 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2476/ 128728 | consumed samples: 39616 | consumed tokens: 81133568 | elapsed time per iteration (s): 15.21 | learning rate: 1.298E-05 | global batch size: 16 | lm loss: 6.232026E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2477/ 128728 | consumed samples: 39632 | consumed tokens: 81166336 | elapsed time per iteration (s): 15.21 | learning rate: 1.299E-05 | global batch size: 16 | lm loss: 6.282386E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2478/ 128728 | consumed samples: 39648 | consumed tokens: 81199104 | elapsed time per iteration (s): 15.22 | learning rate: 1.299E-05 | global batch size: 16 | lm loss: 6.110389E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2479/ 128728 | consumed samples: 39664 | consumed tokens: 81231872 | elapsed time per iteration (s): 15.20 | learning rate: 1.300E-05 | global batch size: 16 | lm loss: 6.111573E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2480/ 128728 | consumed samples: 39680 | consumed tokens: 81264640 | elapsed time per iteration (s): 15.27 | learning rate: 1.300E-05 | global batch size: 16 | lm loss: 6.483891E+00 | grad norm: 0.897 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 2481/ 128728 | consumed samples: 39696 | consumed tokens: 81297408 | elapsed time per iteration (s): 15.17 | learning rate: 1.301E-05 | global batch size: 16 | lm loss: 6.348729E+00 | grad norm: 0.758 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2482/ 128728 | consumed samples: 39712 | consumed tokens: 81330176 | elapsed time per iteration (s): 15.18 | learning rate: 1.301E-05 | global batch size: 16 | lm loss: 6.445699E+00 | grad norm: 0.823 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2483/ 128728 | consumed samples: 39728 | consumed tokens: 81362944 | elapsed time per iteration (s): 15.23 | learning rate: 1.302E-05 | global batch size: 16 | lm loss: 6.384290E+00 | grad norm: 0.946 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2484/ 128728 | consumed samples: 39744 | consumed tokens: 81395712 | elapsed time per iteration (s): 15.20 | learning rate: 1.302E-05 | global batch size: 16 | lm loss: 6.514880E+00 | grad norm: 0.913 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2485/ 128728 | consumed samples: 39760 | consumed tokens: 81428480 | elapsed time per iteration (s): 15.18 | learning rate: 1.303E-05 | global batch size: 16 | lm loss: 6.243723E+00 | grad norm: 1.015 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2486/ 128728 | consumed samples: 39776 | consumed tokens: 81461248 | elapsed time per iteration (s): 15.24 | learning rate: 1.303E-05 | global batch size: 16 | lm loss: 6.220292E+00 | grad norm: 0.700 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2487/ 128728 | consumed samples: 39792 | consumed tokens: 81494016 | elapsed time per iteration (s): 15.22 | learning rate: 1.304E-05 | global batch size: 16 | lm loss: 6.380357E+00 | grad norm: 1.296 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2488/ 128728 | consumed samples: 39808 | consumed tokens: 81526784 | elapsed time per iteration (s): 15.23 | learning rate: 1.304E-05 | global batch size: 16 | lm loss: 6.065780E+00 | grad norm: 0.866 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2489/ 128728 | consumed samples: 39824 | consumed tokens: 81559552 | elapsed time per iteration (s): 15.26 | learning rate: 1.305E-05 | global batch size: 16 | lm loss: 6.013194E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 2490/ 128728 | consumed samples: 39840 | consumed tokens: 81592320 | elapsed time per iteration (s): 15.18 | learning rate: 1.305E-05 | global batch size: 16 | lm loss: 6.132867E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2491/ 128728 | consumed samples: 39856 | consumed tokens: 81625088 | elapsed time per iteration (s): 15.21 | learning rate: 1.306E-05 | global batch size: 16 | lm loss: 6.028798E+00 | grad norm: 0.676 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2492/ 128728 | consumed samples: 39872 | consumed tokens: 81657856 | elapsed time per iteration (s): 15.17 | learning rate: 1.307E-05 | global batch size: 16 | lm loss: 6.127688E+00 | grad norm: 0.703 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 2493/ 128728 | consumed samples: 39888 | consumed tokens: 81690624 | elapsed time per iteration (s): 15.27 | learning rate: 1.307E-05 | global batch size: 16 | lm loss: 6.248683E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 2494/ 128728 | consumed samples: 39904 | consumed tokens: 81723392 | elapsed time per iteration (s): 15.23 | learning rate: 1.308E-05 | global batch size: 16 | lm loss: 6.398225E+00 | grad norm: 0.692 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2495/ 128728 | consumed samples: 39920 | consumed tokens: 81756160 | elapsed time per iteration (s): 15.21 | learning rate: 1.308E-05 | global batch size: 16 | lm loss: 6.293244E+00 | grad norm: 0.758 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2496/ 128728 | consumed samples: 39936 | consumed tokens: 81788928 | elapsed time per iteration (s): 15.24 | learning rate: 1.309E-05 | global batch size: 16 | lm loss: 6.195220E+00 | grad norm: 0.776 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2497/ 128728 | consumed samples: 39952 | consumed tokens: 81821696 | elapsed time per iteration (s): 15.23 | learning rate: 1.309E-05 | global batch size: 16 | lm loss: 6.319185E+00 | grad norm: 0.920 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2498/ 128728 | consumed samples: 39968 | consumed tokens: 81854464 | elapsed time per iteration (s): 15.23 | learning rate: 1.310E-05 | global batch size: 16 | lm loss: 6.121705E+00 | grad norm: 0.776 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2499/ 128728 | consumed samples: 39984 | consumed tokens: 81887232 | elapsed time per iteration (s): 15.28 | learning rate: 1.310E-05 | global batch size: 16 | lm loss: 6.574756E+00 | grad norm: 0.890 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 2500/ 128728 | consumed samples: 40000 | consumed tokens: 81920000 | elapsed time per iteration (s): 15.24 | learning rate: 1.311E-05 | global batch size: 16 | lm loss: 5.883192E+00 | grad norm: 0.692 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default0]:saving checkpoint at iteration 2500 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default0]:[2022-03-03 16:32:24,121] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/mp_rank_00_model_states.pt [default1]:[2022-03-03 16:32:24,906] [INFO] [logging.py:69:log_dist] [Rank 1] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/mp_rank_01_model_states.pt [default1]:[2022-03-03 16:32:34,917] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_09_optim_states.pt [default2]:[2022-03-03 16:32:35,588] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_42_optim_states.pt [default7]:[2022-03-03 16:32:35,832] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_43_optim_states.pt [default1]:[2022-03-03 16:32:36,265] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_33_optim_states.pt [default0]:[2022-03-03 16:32:36,251] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_32_optim_states.pt [default1]:[2022-03-03 16:32:36,382] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_41_optim_states.pt [default7]:[2022-03-03 16:32:36,508] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_19_optim_states.pt [default4]:[2022-03-03 16:32:36,388] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_40_optim_states.pt [default5]:[2022-03-03 16:32:36,418] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_41_optim_states.pt [default6]:[2022-03-03 16:32:36,381] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_42_optim_states.pt [default3]:[2022-03-03 16:32:36,520] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_43_optim_states.pt [default0]:[2022-03-03 16:32:36,530] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_40_optim_states.pt [default1]:[2022-03-03 16:32:36,869] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_17_optim_states.pt [default5]:[2022-03-03 16:32:37,010] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_17_optim_states.pt [default4]:[2022-03-03 16:32:36,973] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_16_optim_states.pt [default7]:[2022-03-03 16:32:37,114] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_35_optim_states.pt [default4]:[2022-03-03 16:32:37,283] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_32_optim_states.pt [default6]:[2022-03-03 16:32:37,226] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_34_optim_states.pt [default2]:[2022-03-03 16:32:37,264] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_18_optim_states.pt [default6]:[2022-03-03 16:32:37,347] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_18_optim_states.pt [default2]:[2022-03-03 16:32:37,535] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_34_optim_states.pt [default3]:[2022-03-03 16:32:37,590] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_35_optim_states.pt [default0]:[2022-03-03 16:32:37,769] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_08_optim_states.pt [default1]:[2022-03-03 16:32:37,840] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_41_optim_states.pt [default0]:[2022-03-03 16:32:37,788] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_16_optim_states.pt [default3]:[2022-03-03 16:32:37,819] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_19_optim_states.pt [default5]:[2022-03-03 16:32:37,944] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_33_optim_states.pt [default0]:[2022-03-03 16:32:38,127] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_44_optim_states.pt [default5]:[2022-03-03 16:32:38,517] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_41_optim_states.pt [default3]:[2022-03-03 16:32:38,645] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_43_optim_states.pt [default6]:[2022-03-03 16:32:38,636] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_34_optim_states.pt [default6]:[2022-03-03 16:32:38,876] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_42_optim_states.pt [default0]:[2022-03-03 16:32:38,854] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_40_optim_states.pt [default4]:[2022-03-03 16:32:38,927] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_40_optim_states.pt [default7]:[2022-03-03 16:32:38,934] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_43_optim_states.pt [default5]:[2022-03-03 16:32:38,936] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_41_optim_states.pt [default1]:[2022-03-03 16:32:39,041] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_33_optim_states.pt [default4]:[2022-03-03 16:32:39,028] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_32_optim_states.pt [default0]:[2022-03-03 16:32:39,036] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_32_optim_states.pt [default5]:[2022-03-03 16:32:39,105] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_33_optim_states.pt [default3]:[2022-03-03 16:32:39,148] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_11_optim_states.pt [default2]:[2022-03-03 16:32:39,191] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_42_optim_states.pt [default0]:[2022-03-03 16:32:39,230] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_24_optim_states.pt [default7]:[2022-03-03 16:32:39,215] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_35_optim_states.pt [default3]:[2022-03-03 16:32:39,264] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_43_optim_states.pt [default7]:[2022-03-03 16:32:39,389] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_27_optim_states.pt [default4]:[2022-03-03 16:32:39,397] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_40_optim_states.pt [default1]:[2022-03-03 16:32:39,432] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_25_optim_states.pt [default2]:[2022-03-03 16:32:39,359] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_34_optim_states.pt [default3]:[2022-03-03 16:32:39,438] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_35_optim_states.pt [default6]:[2022-03-03 16:32:39,406] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_26_optim_states.pt [default7]:[2022-03-03 16:32:39,453] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_11_optim_states.pt [default5]:[2022-03-03 16:32:39,485] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_09_optim_states.pt [default4]:[2022-03-03 16:32:39,437] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_08_optim_states.pt [default6]:[2022-03-03 16:32:39,528] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_10_optim_states.pt [default3]:[2022-03-03 16:32:39,539] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_15_optim_states.pt [default5]:[2022-03-03 16:32:39,524] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_13_optim_states.pt [default5]:[2022-03-03 16:32:39,544] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_25_optim_states.pt [default5]:[2022-03-03 16:32:39,596] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt [default2]:[2022-03-03 16:32:39,642] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_10_optim_states.pt [default1]:[2022-03-03 16:32:39,636] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_01_optim_states.pt [default2]:[2022-03-03 16:32:39,660] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_26_optim_states.pt [default3]:[2022-03-03 16:32:39,773] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_39_optim_states.pt [default4]:[2022-03-03 16:32:39,839] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_24_optim_states.pt [default5]:[2022-03-03 16:32:39,982] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_21_optim_states.pt [default5]:[2022-03-03 16:32:40,067] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_45_optim_states.pt [default4]:[2022-03-03 16:32:40,074] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_20_optim_states.pt [default1]:[2022-03-03 16:32:40,188] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_05_optim_states.pt [default4]:[2022-03-03 16:32:40,262] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_16_optim_states.pt [default6]:[2022-03-03 16:32:40,314] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_38_optim_states.pt [default3]:[2022-03-03 16:32:40,351] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_27_optim_states.pt [default7]:[2022-03-03 16:32:40,350] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_39_optim_states.pt [default2]:[2022-03-03 16:32:40,431] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_38_optim_states.pt [default5]:[2022-03-03 16:32:40,464] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_37_optim_states.pt [default2]:[2022-03-03 16:32:40,494] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_14_optim_states.pt [default7]:[2022-03-03 16:32:40,584] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_15_optim_states.pt [default1]:[2022-03-03 16:32:40,563] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_29_optim_states.pt [default1]:[2022-03-03 16:32:40,613] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_41_optim_states.pt [default6]:[2022-03-03 16:32:40,621] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_14_optim_states.pt [default0]:[2022-03-03 16:32:40,585] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_12_optim_states.pt [default1]:[2022-03-03 16:32:40,588] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_13_optim_states.pt [default2]:[2022-03-03 16:32:40,578] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_42_optim_states.pt [default2]:[2022-03-03 16:32:40,617] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_14_optim_states.pt [default0]:[2022-03-03 16:32:40,687] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt [default1]:[2022-03-03 16:32:40,679] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_45_optim_states.pt [default0]:[2022-03-03 16:32:40,696] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_36_optim_states.pt [default1]:[2022-03-03 16:32:40,726] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_37_optim_states.pt [default4]:[2022-03-03 16:32:40,796] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt [default5]:[2022-03-03 16:32:40,737] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_01_optim_states.pt [default3]:[2022-03-03 16:32:40,788] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_23_optim_states.pt [default6]:[2022-03-03 16:32:40,832] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_46_optim_states.pt [default0]:[2022-03-03 16:32:40,756] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_04_optim_states.pt [default1]:[2022-03-03 16:32:40,834] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_37_optim_states.pt [default3]:[2022-03-03 16:32:40,859] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_31_optim_states.pt [default4]:[2022-03-03 16:32:40,823] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt [default4]:[2022-03-03 16:32:40,841] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_12_optim_states.pt [default5]:[2022-03-03 16:32:40,931] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_13_optim_states.pt [default4]:[2022-03-03 16:32:40,955] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_32_optim_states.pt [default6]:[2022-03-03 16:32:41,002] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_30_optim_states.pt [default5]:[2022-03-03 16:32:40,937] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_33_optim_states.pt [default4]:[2022-03-03 16:32:40,975] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_36_optim_states.pt [default3]:[2022-03-03 16:32:41,002] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_31_optim_states.pt [default2]:[2022-03-03 16:32:41,053] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_30_optim_states.pt [default2]:[2022-03-03 16:32:41,198] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_30_optim_states.pt [default0]:[2022-03-03 16:32:41,228] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_40_optim_states.pt [default4]:[2022-03-03 16:32:41,261] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_44_optim_states.pt [default7]:[2022-03-03 16:32:41,316] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_39_optim_states.pt [default2]:[2022-03-03 16:32:41,433] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_38_optim_states.pt [default4]:[2022-03-03 16:32:41,401] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_12_optim_states.pt [default6]:[2022-03-03 16:32:41,454] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_38_optim_states.pt [default0]:[2022-03-03 16:32:41,502] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_36_optim_states.pt [default0]:[2022-03-03 16:32:41,487] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_16_optim_states.pt [default0]:[2022-03-03 16:32:41,471] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_28_optim_states.pt [default4]:[2022-03-03 16:32:41,495] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_36_optim_states.pt [default7]:[2022-03-03 16:32:41,496] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_31_optim_states.pt [default5]:[2022-03-03 16:32:41,507] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_37_optim_states.pt [default1]:[2022-03-03 16:32:41,635] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_37_optim_states.pt [default6]:[2022-03-03 16:32:41,651] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_14_optim_states.pt [default2]:[2022-03-03 16:32:41,690] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_30_optim_states.pt [default2]:[2022-03-03 16:32:41,656] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_22_optim_states.pt [default3]:[2022-03-03 16:32:41,603] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_39_optim_states.pt [default7]:[2022-03-03 16:32:41,735] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_39_optim_states.pt [default3]:[2022-03-03 16:32:41,816] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_23_optim_states.pt [default6]:[2022-03-03 16:32:41,790] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_38_optim_states.pt [default4]:[2022-03-03 16:32:41,749] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_36_optim_states.pt [default5]:[2022-03-03 16:32:41,694] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_13_optim_states.pt [default7]:[2022-03-03 16:32:41,710] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_15_optim_states.pt [default6]:[2022-03-03 16:32:41,862] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_38_optim_states.pt [default2]:[2022-03-03 16:32:41,841] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_38_optim_states.pt [default3]:[2022-03-03 16:32:41,977] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_03_optim_states.pt [default7]:[2022-03-03 16:32:41,977] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_35_optim_states.pt [default4]:[2022-03-03 16:32:41,815] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_12_optim_states.pt [default3]:[2022-03-03 16:32:42,019] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_39_optim_states.pt [default5]:[2022-03-03 16:32:42,047] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_37_optim_states.pt [default3]:[2022-03-03 16:32:42,168] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_27_optim_states.pt [default3]:[2022-03-03 16:32:42,189] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_15_optim_states.pt [default2]:[2022-03-03 16:32:42,154] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_22_optim_states.pt [default7]:[2022-03-03 16:32:42,437] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_47_optim_states.pt [default1]:[2022-03-03 16:32:42,415] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_33_optim_states.pt [default7]:[2022-03-03 16:32:42,499] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_39_optim_states.pt [default1]:[2022-03-03 16:32:42,544] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_13_optim_states.pt [default0]:[2022-03-03 16:32:42,469] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_32_optim_states.pt [default6]:[2022-03-03 16:32:42,466] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_34_optim_states.pt [default2]:[2022-03-03 16:32:42,543] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_02_optim_states.pt [default0]:[2022-03-03 16:32:42,626] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_12_optim_states.pt [default2]:[2022-03-03 16:32:42,675] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_26_optim_states.pt [default5]:[2022-03-03 16:32:42,719] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_05_optim_states.pt [default4]:[2022-03-03 16:32:42,710] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_04_optim_states.pt [default2]:[2022-03-03 16:32:42,738] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_18_optim_states.pt [default5]:[2022-03-03 16:32:42,690] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_21_optim_states.pt [default1]:[2022-03-03 16:32:42,691] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_21_optim_states.pt [default0]:[2022-03-03 16:32:42,766] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_36_optim_states.pt [default0]:[2022-03-03 16:32:42,827] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_32_optim_states.pt [default6]:[2022-03-03 16:32:42,907] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_10_optim_states.pt [default2]:[2022-03-03 16:32:42,905] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_34_optim_states.pt [default1]:[2022-03-03 16:32:42,983] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_01_optim_states.pt [default2]:[2022-03-03 16:32:42,948] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_06_optim_states.pt [default1]:[2022-03-03 16:32:42,930] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_09_optim_states.pt [default7]:[2022-03-03 16:32:43,031] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_11_optim_states.pt [default3]:[2022-03-03 16:32:43,041] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_35_optim_states.pt [default0]:[2022-03-03 16:32:43,048] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_28_optim_states.pt [default3]:[2022-03-03 16:32:42,997] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_35_optim_states.pt [default4]:[2022-03-03 16:32:43,082] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_04_optim_states.pt [default0]:[2022-03-03 16:32:43,088] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_12_optim_states.pt [default0]:[2022-03-03 16:32:43,049] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_08_optim_states.pt [default2]:[2022-03-03 16:32:43,076] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_14_optim_states.pt [default7]:[2022-03-03 16:32:43,144] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_19_optim_states.pt [default5]:[2022-03-03 16:32:43,121] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_05_optim_states.pt [default1]:[2022-03-03 16:32:43,135] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_29_optim_states.pt [default3]:[2022-03-03 16:32:43,094] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_07_optim_states.pt [default3]:[2022-03-03 16:32:43,156] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_39_optim_states.pt [default0]:[2022-03-03 16:32:43,142] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_20_optim_states.pt [default7]:[2022-03-03 16:32:43,162] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_43_optim_states.pt [default2]:[2022-03-03 16:32:43,166] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_38_optim_states.pt [default4]:[2022-03-03 16:32:43,184] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_08_optim_states.pt [default3]:[2022-03-03 16:32:43,193] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_47_optim_states.pt [default5]:[2022-03-03 16:32:43,208] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_45_optim_states.pt [default1]:[2022-03-03 16:32:43,291] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_17_optim_states.pt [default6]:[2022-03-03 16:32:43,248] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_42_optim_states.pt [default7]:[2022-03-03 16:32:43,312] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_23_optim_states.pt [default6]:[2022-03-03 16:32:43,250] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_30_optim_states.pt [default3]:[2022-03-03 16:32:43,301] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_23_optim_states.pt [default7]:[2022-03-03 16:32:43,318] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_31_optim_states.pt [default3]:[2022-03-03 16:32:43,327] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_31_optim_states.pt [default5]:[2022-03-03 16:32:43,440] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_21_optim_states.pt [default6]:[2022-03-03 16:32:43,361] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_18_optim_states.pt [default6]:[2022-03-03 16:32:43,435] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_22_optim_states.pt [default5]:[2022-03-03 16:32:43,411] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_17_optim_states.pt [default2]:[2022-03-03 16:32:43,398] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_06_optim_states.pt [default4]:[2022-03-03 16:32:43,490] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_36_optim_states.pt [default3]:[2022-03-03 16:32:43,474] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_07_optim_states.pt [default4]:[2022-03-03 16:32:43,481] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_44_optim_states.pt [default5]:[2022-03-03 16:32:43,455] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_37_optim_states.pt [default3]:[2022-03-03 16:32:43,468] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_11_optim_states.pt [default3]:[2022-03-03 16:32:43,549] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_07_optim_states.pt [default2]:[2022-03-03 16:32:43,507] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_22_optim_states.pt [default3]:[2022-03-03 16:32:43,541] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_47_optim_states.pt [default0]:[2022-03-03 16:32:43,579] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_20_optim_states.pt [default1]:[2022-03-03 16:32:43,586] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_21_optim_states.pt [default2]:[2022-03-03 16:32:43,626] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_46_optim_states.pt [default6]:[2022-03-03 16:32:43,685] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_18_optim_states.pt [default1]:[2022-03-03 16:32:43,687] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_29_optim_states.pt [default4]:[2022-03-03 16:32:43,727] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_28_optim_states.pt [default0]:[2022-03-03 16:32:43,724] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt [default0]:[2022-03-03 16:32:43,827] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_12_optim_states.pt [default3]:[2022-03-03 16:32:43,812] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_27_optim_states.pt [default1]:[2022-03-03 16:32:43,862] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_13_optim_states.pt [default0]:[2022-03-03 16:32:43,858] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_20_optim_states.pt [default1]:[2022-03-03 16:32:43,853] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_25_optim_states.pt [default3]:[2022-03-03 16:32:43,941] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_15_optim_states.pt [default0]:[2022-03-03 16:32:43,892] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_24_optim_states.pt [default0]:[2022-03-03 16:32:43,963] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_28_optim_states.pt [default1]:[2022-03-03 16:32:43,945] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_29_optim_states.pt [default6]:[2022-03-03 16:32:44,032] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_30_optim_states.pt [default1]:[2022-03-03 16:32:44,029] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_37_optim_states.pt [default7]:[2022-03-03 16:32:44,030] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_19_optim_states.pt [default5]:[2022-03-03 16:32:44,021] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_29_optim_states.pt [default0]:[2022-03-03 16:32:44,061] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_44_optim_states.pt [default7]:[2022-03-03 16:32:44,071] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_15_optim_states.pt [default0]:[2022-03-03 16:32:44,033] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_36_optim_states.pt [default1]:[2022-03-03 16:32:44,124] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_45_optim_states.pt [default6]:[2022-03-03 16:32:44,263] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_14_optim_states.pt [default7]:[2022-03-03 16:32:44,246] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_15_optim_states.pt [default4]:[2022-03-03 16:32:44,261] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_20_optim_states.pt [default2]:[2022-03-03 16:32:44,349] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_46_optim_states.pt [default6]:[2022-03-03 16:32:44,365] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_46_optim_states.pt [default7]:[2022-03-03 16:32:44,348] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_47_optim_states.pt [default7]:[2022-03-03 16:32:44,314] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_03_optim_states.pt [default6]:[2022-03-03 16:32:44,361] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_02_optim_states.pt [default5]:[2022-03-03 16:32:44,364] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_21_optim_states.pt [default7]:[2022-03-03 16:32:44,512] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_31_optim_states.pt [default1]:[2022-03-03 16:32:44,536] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_13_optim_states.pt [default2]:[2022-03-03 16:32:44,564] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_46_optim_states.pt [default3]:[2022-03-03 16:32:44,590] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_47_optim_states.pt [default5]:[2022-03-03 16:32:44,581] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_09_optim_states.pt [default7]:[2022-03-03 16:32:44,676] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_47_optim_states.pt [default4]:[2022-03-03 16:32:44,690] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_20_optim_states.pt [default6]:[2022-03-03 16:32:44,648] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_14_optim_states.pt [default6]:[2022-03-03 16:32:44,692] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_22_optim_states.pt [default6]:[2022-03-03 16:32:44,670] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_46_optim_states.pt [default4]:[2022-03-03 16:32:44,644] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_08_optim_states.pt [default2]:[2022-03-03 16:32:44,685] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_06_optim_states.pt [default4]:[2022-03-03 16:32:44,756] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_12_optim_states.pt [default7]:[2022-03-03 16:32:44,715] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_23_optim_states.pt [default0]:[2022-03-03 16:32:44,719] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_28_optim_states.pt [default4]:[2022-03-03 16:32:44,748] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_20_optim_states.pt [default6]:[2022-03-03 16:32:44,865] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_02_optim_states.pt [default3]:[2022-03-03 16:32:44,934] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_19_optim_states.pt [default3]:[2022-03-03 16:32:44,990] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_03_optim_states.pt [default2]:[2022-03-03 16:32:44,962] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_18_optim_states.pt [default5]:[2022-03-03 16:32:45,013] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_01_optim_states.pt [default4]:[2022-03-03 16:32:45,095] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_32_optim_states.pt [default5]:[2022-03-03 16:32:45,134] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_13_optim_states.pt [default3]:[2022-03-03 16:32:45,138] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_23_optim_states.pt [default5]:[2022-03-03 16:32:45,108] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_09_optim_states.pt [default2]:[2022-03-03 16:32:45,188] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_02_optim_states.pt [default1]:[2022-03-03 16:32:45,180] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_33_optim_states.pt [default3]:[2022-03-03 16:32:45,272] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_03_optim_states.pt [default7]:[2022-03-03 16:32:45,272] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_03_optim_states.pt [default4]:[2022-03-03 16:32:45,202] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_28_optim_states.pt [default7]:[2022-03-03 16:32:45,263] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_11_optim_states.pt [default2]:[2022-03-03 16:32:45,326] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_10_optim_states.pt [default5]:[2022-03-03 16:32:45,336] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_29_optim_states.pt [default6]:[2022-03-03 16:32:45,432] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_10_optim_states.pt [default5]:[2022-03-03 16:32:45,445] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_33_optim_states.pt [default4]:[2022-03-03 16:32:45,463] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_24_optim_states.pt [default2]:[2022-03-03 16:32:45,573] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_34_optim_states.pt [default7]:[2022-03-03 16:32:45,607] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_43_optim_states.pt [default1]:[2022-03-03 16:32:45,671] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_25_optim_states.pt [default0]:[2022-03-03 16:32:45,467] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_16_optim_states.pt [default2]:[2022-03-03 16:32:45,731] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_26_optim_states.pt [default1]:[2022-03-03 16:32:45,809] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_41_optim_states.pt [default7]:[2022-03-03 16:32:45,953] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_27_optim_states.pt [default5]:[2022-03-03 16:32:45,900] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_25_optim_states.pt [default6]:[2022-03-03 16:32:45,907] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_26_optim_states.pt [default1]:[2022-03-03 16:32:45,984] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_21_optim_states.pt [default7]:[2022-03-03 16:32:45,989] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_07_optim_states.pt [default0]:[2022-03-03 16:32:46,072] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_04_optim_states.pt [default2]:[2022-03-03 16:32:46,197] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_14_optim_states.pt [default5]:[2022-03-03 16:32:46,201] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_45_optim_states.pt [default2]:[2022-03-03 16:32:46,266] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_10_optim_states.pt [default3]:[2022-03-03 16:32:46,243] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_31_optim_states.pt [default1]:[2022-03-03 16:32:46,232] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_25_optim_states.pt [default6]:[2022-03-03 16:32:46,246] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_06_optim_states.pt [default3]:[2022-03-03 16:32:46,302] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_27_optim_states.pt [default3]:[2022-03-03 16:32:46,273] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_15_optim_states.pt [default3]:[2022-03-03 16:32:46,259] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_19_optim_states.pt [default4]:[2022-03-03 16:32:46,341] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_16_optim_states.pt [default3]:[2022-03-03 16:32:46,233] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_11_optim_states.pt [default7]:[2022-03-03 16:32:46,305] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_07_optim_states.pt [default7]:[2022-03-03 16:32:46,356] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_23_optim_states.pt [default0]:[2022-03-03 16:32:46,232] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_24_optim_states.pt [default0]:[2022-03-03 16:32:46,366] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_24_optim_states.pt [default1]:[2022-03-03 16:32:46,437] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_21_optim_states.pt [default6]:[2022-03-03 16:32:46,380] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_22_optim_states.pt [default6]:[2022-03-03 16:32:46,372] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_42_optim_states.pt [default5]:[2022-03-03 16:32:46,463] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_17_optim_states.pt [default0]:[2022-03-03 16:32:46,395] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_08_optim_states.pt [default3]:[2022-03-03 16:32:46,486] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_07_optim_states.pt [default2]:[2022-03-03 16:32:46,524] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_02_optim_states.pt [default7]:[2022-03-03 16:32:46,563] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_07_optim_states.pt [default4]:[2022-03-03 16:32:46,484] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_04_optim_states.pt [default1]:[2022-03-03 16:32:46,559] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_17_optim_states.pt [default4]:[2022-03-03 16:32:46,584] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt [default0]:[2022-03-03 16:32:46,652] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_44_optim_states.pt [default6]:[2022-03-03 16:32:46,629] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_06_optim_states.pt [default7]:[2022-03-03 16:32:46,584] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_07_optim_states.pt [default2]:[2022-03-03 16:32:46,710] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_26_optim_states.pt [default3]:[2022-03-03 16:32:46,695] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_43_optim_states.pt [default6]:[2022-03-03 16:32:46,685] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_06_optim_states.pt [default2]:[2022-03-03 16:32:46,718] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_06_optim_states.pt [default2]:[2022-03-03 16:32:46,714] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_10_optim_states.pt [default6]:[2022-03-03 16:32:46,678] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_06_optim_states.pt [default1]:[2022-03-03 16:32:46,857] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_09_optim_states.pt [default2]:[2022-03-03 16:32:46,843] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_30_optim_states.pt [default4]:[2022-03-03 16:32:46,887] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_44_optim_states.pt [default1]:[2022-03-03 16:32:46,855] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_05_optim_states.pt [default3]:[2022-03-03 16:32:46,790] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_11_optim_states.pt [default0]:[2022-03-03 16:32:46,986] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_20_optim_states.pt [default0]:[2022-03-03 16:32:46,973] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_04_optim_states.pt [default4]:[2022-03-03 16:32:47,009] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_28_optim_states.pt [default1]:[2022-03-03 16:32:46,970] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_45_optim_states.pt [default0]:[2022-03-03 16:32:46,934] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_04_optim_states.pt [default3]:[2022-03-03 16:32:46,987] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_19_optim_states.pt [default2]:[2022-03-03 16:32:47,044] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_22_optim_states.pt [default1]:[2022-03-03 16:32:46,953] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_05_optim_states.pt [default5]:[2022-03-03 16:32:47,060] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_29_optim_states.pt [default5]:[2022-03-03 16:32:47,119] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_05_optim_states.pt [default7]:[2022-03-03 16:32:47,462] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_23_optim_states.pt [default6]:[2022-03-03 16:32:47,439] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_22_optim_states.pt [default2]:[2022-03-03 16:32:47,605] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_18_optim_states.pt [default1]:[2022-03-03 16:32:47,629] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_17_optim_states.pt [default0]:[2022-03-03 16:32:47,757] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_16_optim_states.pt [default1]:[2022-03-03 16:32:47,708] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_09_optim_states.pt [default5]:[2022-03-03 16:32:47,669] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_09_optim_states.pt [default7]:[2022-03-03 16:32:47,735] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_35_optim_states.pt [default7]:[2022-03-03 16:32:47,886] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_47_optim_states.pt [default2]:[2022-03-03 16:32:47,900] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_46_optim_states.pt [default7]:[2022-03-03 16:32:47,822] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_03_optim_states.pt [default6]:[2022-03-03 16:32:47,840] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_34_optim_states.pt [default6]:[2022-03-03 16:32:47,844] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_02_optim_states.pt [default0]:[2022-03-03 16:32:47,889] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [default0]:[2022-03-03 16:32:47,940] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_08_optim_states.pt [default4]:[2022-03-03 16:32:48,010] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_08_optim_states.pt [default6]:[2022-03-03 16:32:48,137] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_18_optim_states.pt [default5]:[2022-03-03 16:32:48,149] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_29_optim_states.pt [default4]:[2022-03-03 16:32:48,175] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_28_optim_states.pt [default7]:[2022-03-03 16:32:48,252] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_11_optim_states.pt [default7]:[2022-03-03 16:32:48,471] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_31_optim_states.pt [default4]:[2022-03-03 16:32:48,526] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_16_optim_states.pt [default6]:[2022-03-03 16:32:48,468] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_30_optim_states.pt [default2]:[2022-03-03 16:32:48,485] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_42_optim_states.pt [default4]:[2022-03-03 16:32:48,658] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_24_optim_states.pt [default6]:[2022-03-03 16:32:48,746] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_10_optim_states.pt [default7]:[2022-03-03 16:32:48,737] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_19_optim_states.pt [default5]:[2022-03-03 16:32:48,837] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_17_optim_states.pt [default6]:[2022-03-03 16:32:48,941] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_26_optim_states.pt [default0]:[2022-03-03 16:32:48,993] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_4_mp_rank_40_optim_states.pt [default7]:[2022-03-03 16:32:49,456] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_27_optim_states.pt [default5]:[2022-03-03 16:32:49,406] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_25_optim_states.pt [default3]:[2022-03-03 16:32:49,454] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_47_optim_states.pt [default6]:[2022-03-03 16:32:49,534] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_46_optim_states.pt [default1]:[2022-03-03 16:32:49,616] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_6_mp_rank_05_optim_states.pt [default1]:[2022-03-03 16:32:49,850] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_45_optim_states.pt [default6]:[2022-03-03 16:32:50,223] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_26_optim_states.pt [default1]:[2022-03-03 16:32:50,244] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_01_optim_states.pt [default0]:[2022-03-03 16:32:51,368] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_44_optim_states.pt [default0]:[2022-03-03 16:32:52,306] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt [default2]:[2022-03-03 16:32:52,351] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_02_optim_states.pt [default6]:[2022-03-03 16:32:52,429] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_02_optim_states.pt [default7]:[2022-03-03 16:32:52,459] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_27_optim_states.pt [default3]:[2022-03-03 16:32:52,455] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_0_mp_rank_03_optim_states.pt [default4]:[2022-03-03 16:32:53,038] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt [default1]:[2022-03-03 16:32:53,106] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_2_mp_rank_01_optim_states.pt [default5]:[2022-03-03 16:32:53,199] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_1_mp_rank_01_optim_states.pt [default7]:[2022-03-03 16:32:53,272] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_03_optim_states.pt [default4]:[2022-03-03 16:32:53,422] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_44_optim_states.pt [default5]:[2022-03-03 16:32:53,503] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_05_optim_states.pt [default4]:[2022-03-03 16:32:53,489] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_7_mp_rank_04_optim_states.pt [default4]:[2022-03-03 16:32:53,528] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_40_optim_states.pt [default4]:[2022-03-03 16:32:53,603] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_24_optim_states.pt [default5]:[2022-03-03 16:32:53,525] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_41_optim_states.pt [default5]:[2022-03-03 16:32:53,529] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_3_mp_rank_45_optim_states.pt [default0]: successfully saved checkpoint at iteration 2500 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default7]:time (ms) | save-checkpoint: 38545.52 [default5]:[2022-03-03 16:32:53,626] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step2500/bf16_zero_pp_rank_5_mp_rank_25_optim_states.pt [default7]: iteration 2501/ 128728 | consumed samples: 40016 | consumed tokens: 81952768 | elapsed time per iteration (s): 53.79 | learning rate: 1.311E-05 | global batch size: 16 | lm loss: 6.252749E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 0.297 | TFLOPs: 2.28 | [default7]: iteration 2502/ 128728 | consumed samples: 40032 | consumed tokens: 81985536 | elapsed time per iteration (s): 15.24 | learning rate: 1.312E-05 | global batch size: 16 | lm loss: 5.942217E+00 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2503/ 128728 | consumed samples: 40048 | consumed tokens: 82018304 | elapsed time per iteration (s): 15.22 | learning rate: 1.312E-05 | global batch size: 16 | lm loss: 6.333421E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2504/ 128728 | consumed samples: 40064 | consumed tokens: 82051072 | elapsed time per iteration (s): 15.24 | learning rate: 1.313E-05 | global batch size: 16 | lm loss: 6.306670E+00 | grad norm: 0.898 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2505/ 128728 | consumed samples: 40080 | consumed tokens: 82083840 | elapsed time per iteration (s): 15.24 | learning rate: 1.313E-05 | global batch size: 16 | lm loss: 6.183002E+00 | grad norm: 0.707 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2506/ 128728 | consumed samples: 40096 | consumed tokens: 82116608 | elapsed time per iteration (s): 15.25 | learning rate: 1.314E-05 | global batch size: 16 | lm loss: 6.207052E+00 | grad norm: 0.708 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2507/ 128728 | consumed samples: 40112 | consumed tokens: 82149376 | elapsed time per iteration (s): 15.25 | learning rate: 1.314E-05 | global batch size: 16 | lm loss: 6.162314E+00 | grad norm: 0.734 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2508/ 128728 | consumed samples: 40128 | consumed tokens: 82182144 | elapsed time per iteration (s): 15.25 | learning rate: 1.315E-05 | global batch size: 16 | lm loss: 6.242827E+00 | grad norm: 1.061 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2509/ 128728 | consumed samples: 40144 | consumed tokens: 82214912 | elapsed time per iteration (s): 15.21 | learning rate: 1.315E-05 | global batch size: 16 | lm loss: 6.144494E+00 | grad norm: 0.684 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2510/ 128728 | consumed samples: 40160 | consumed tokens: 82247680 | elapsed time per iteration (s): 15.21 | learning rate: 1.316E-05 | global batch size: 16 | lm loss: 6.119376E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2511/ 128728 | consumed samples: 40176 | consumed tokens: 82280448 | elapsed time per iteration (s): 15.18 | learning rate: 1.316E-05 | global batch size: 16 | lm loss: 6.218392E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2512/ 128728 | consumed samples: 40192 | consumed tokens: 82313216 | elapsed time per iteration (s): 15.19 | learning rate: 1.317E-05 | global batch size: 16 | lm loss: 6.246577E+00 | grad norm: 0.875 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2513/ 128728 | consumed samples: 40208 | consumed tokens: 82345984 | elapsed time per iteration (s): 15.23 | learning rate: 1.318E-05 | global batch size: 16 | lm loss: 6.041477E+00 | grad norm: 0.692 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2514/ 128728 | consumed samples: 40224 | consumed tokens: 82378752 | elapsed time per iteration (s): 15.17 | learning rate: 1.318E-05 | global batch size: 16 | lm loss: 6.023715E+00 | grad norm: 0.856 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2515/ 128728 | consumed samples: 40240 | consumed tokens: 82411520 | elapsed time per iteration (s): 15.24 | learning rate: 1.319E-05 | global batch size: 16 | lm loss: 6.201522E+00 | grad norm: 0.758 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2516/ 128728 | consumed samples: 40256 | consumed tokens: 82444288 | elapsed time per iteration (s): 15.26 | learning rate: 1.319E-05 | global batch size: 16 | lm loss: 6.286212E+00 | grad norm: 1.069 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 2517/ 128728 | consumed samples: 40272 | consumed tokens: 82477056 | elapsed time per iteration (s): 15.23 | learning rate: 1.320E-05 | global batch size: 16 | lm loss: 6.275428E+00 | grad norm: 1.433 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2518/ 128728 | consumed samples: 40288 | consumed tokens: 82509824 | elapsed time per iteration (s): 15.22 | learning rate: 1.320E-05 | global batch size: 16 | lm loss: 6.296135E+00 | grad norm: 0.762 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2519/ 128728 | consumed samples: 40304 | consumed tokens: 82542592 | elapsed time per iteration (s): 15.22 | learning rate: 1.321E-05 | global batch size: 16 | lm loss: 6.135454E+00 | grad norm: 0.890 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2520/ 128728 | consumed samples: 40320 | consumed tokens: 82575360 | elapsed time per iteration (s): 15.23 | learning rate: 1.321E-05 | global batch size: 16 | lm loss: 6.092723E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2521/ 128728 | consumed samples: 40336 | consumed tokens: 82608128 | elapsed time per iteration (s): 15.24 | learning rate: 1.322E-05 | global batch size: 16 | lm loss: 5.971050E+00 | grad norm: 0.789 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2522/ 128728 | consumed samples: 40352 | consumed tokens: 82640896 | elapsed time per iteration (s): 15.22 | learning rate: 1.322E-05 | global batch size: 16 | lm loss: 6.361732E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2523/ 128728 | consumed samples: 40368 | consumed tokens: 82673664 | elapsed time per iteration (s): 15.21 | learning rate: 1.323E-05 | global batch size: 16 | lm loss: 6.271525E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2524/ 128728 | consumed samples: 40384 | consumed tokens: 82706432 | elapsed time per iteration (s): 15.24 | learning rate: 1.323E-05 | global batch size: 16 | lm loss: 5.950109E+00 | grad norm: 0.658 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2525/ 128728 | consumed samples: 40400 | consumed tokens: 82739200 | elapsed time per iteration (s): 15.23 | learning rate: 1.324E-05 | global batch size: 16 | lm loss: 6.177237E+00 | grad norm: 0.741 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2526/ 128728 | consumed samples: 40416 | consumed tokens: 82771968 | elapsed time per iteration (s): 15.21 | learning rate: 1.324E-05 | global batch size: 16 | lm loss: 6.452248E+00 | grad norm: 0.703 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2527/ 128728 | consumed samples: 40432 | consumed tokens: 82804736 | elapsed time per iteration (s): 15.23 | learning rate: 1.325E-05 | global batch size: 16 | lm loss: 6.125947E+00 | grad norm: 0.804 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2528/ 128728 | consumed samples: 40448 | consumed tokens: 82837504 | elapsed time per iteration (s): 15.23 | learning rate: 1.325E-05 | global batch size: 16 | lm loss: 6.275990E+00 | grad norm: 0.907 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2529/ 128728 | consumed samples: 40464 | consumed tokens: 82870272 | elapsed time per iteration (s): 15.18 | learning rate: 1.326E-05 | global batch size: 16 | lm loss: 6.224532E+00 | grad norm: 1.093 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2530/ 128728 | consumed samples: 40480 | consumed tokens: 82903040 | elapsed time per iteration (s): 15.22 | learning rate: 1.326E-05 | global batch size: 16 | lm loss: 6.188871E+00 | grad norm: 0.825 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2531/ 128728 | consumed samples: 40496 | consumed tokens: 82935808 | elapsed time per iteration (s): 15.19 | learning rate: 1.327E-05 | global batch size: 16 | lm loss: 6.316185E+00 | grad norm: 0.739 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2532/ 128728 | consumed samples: 40512 | consumed tokens: 82968576 | elapsed time per iteration (s): 15.17 | learning rate: 1.328E-05 | global batch size: 16 | lm loss: 6.173674E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2533/ 128728 | consumed samples: 40528 | consumed tokens: 83001344 | elapsed time per iteration (s): 15.22 | learning rate: 1.328E-05 | global batch size: 16 | lm loss: 6.066485E+00 | grad norm: 0.805 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2534/ 128728 | consumed samples: 40544 | consumed tokens: 83034112 | elapsed time per iteration (s): 15.23 | learning rate: 1.329E-05 | global batch size: 16 | lm loss: 5.854393E+00 | grad norm: 1.017 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2535/ 128728 | consumed samples: 40560 | consumed tokens: 83066880 | elapsed time per iteration (s): 15.22 | learning rate: 1.329E-05 | global batch size: 16 | lm loss: 6.168399E+00 | grad norm: 0.697 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2536/ 128728 | consumed samples: 40576 | consumed tokens: 83099648 | elapsed time per iteration (s): 15.21 | learning rate: 1.330E-05 | global batch size: 16 | lm loss: 6.101472E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2537/ 128728 | consumed samples: 40592 | consumed tokens: 83132416 | elapsed time per iteration (s): 15.20 | learning rate: 1.330E-05 | global batch size: 16 | lm loss: 6.186237E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2538/ 128728 | consumed samples: 40608 | consumed tokens: 83165184 | elapsed time per iteration (s): 15.26 | learning rate: 1.331E-05 | global batch size: 16 | lm loss: 6.074104E+00 | grad norm: 0.902 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2539/ 128728 | consumed samples: 40624 | consumed tokens: 83197952 | elapsed time per iteration (s): 15.22 | learning rate: 1.331E-05 | global batch size: 16 | lm loss: 6.332224E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2540/ 128728 | consumed samples: 40640 | consumed tokens: 83230720 | elapsed time per iteration (s): 15.23 | learning rate: 1.332E-05 | global batch size: 16 | lm loss: 6.297553E+00 | grad norm: 0.874 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2541/ 128728 | consumed samples: 40656 | consumed tokens: 83263488 | elapsed time per iteration (s): 15.22 | learning rate: 1.332E-05 | global batch size: 16 | lm loss: 6.340025E+00 | grad norm: 1.212 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2542/ 128728 | consumed samples: 40672 | consumed tokens: 83296256 | elapsed time per iteration (s): 15.22 | learning rate: 1.333E-05 | global batch size: 16 | lm loss: 6.225410E+00 | grad norm: 0.942 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2543/ 128728 | consumed samples: 40688 | consumed tokens: 83329024 | elapsed time per iteration (s): 15.20 | learning rate: 1.333E-05 | global batch size: 16 | lm loss: 6.225701E+00 | grad norm: 0.677 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2544/ 128728 | consumed samples: 40704 | consumed tokens: 83361792 | elapsed time per iteration (s): 15.19 | learning rate: 1.334E-05 | global batch size: 16 | lm loss: 6.324197E+00 | grad norm: 0.832 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2545/ 128728 | consumed samples: 40720 | consumed tokens: 83394560 | elapsed time per iteration (s): 15.21 | learning rate: 1.334E-05 | global batch size: 16 | lm loss: 6.328512E+00 | grad norm: 1.342 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2546/ 128728 | consumed samples: 40736 | consumed tokens: 83427328 | elapsed time per iteration (s): 15.15 | learning rate: 1.335E-05 | global batch size: 16 | lm loss: 6.272426E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 2547/ 128728 | consumed samples: 40752 | consumed tokens: 83460096 | elapsed time per iteration (s): 15.21 | learning rate: 1.335E-05 | global batch size: 16 | lm loss: 6.139767E+00 | grad norm: 0.983 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2548/ 128728 | consumed samples: 40768 | consumed tokens: 83492864 | elapsed time per iteration (s): 15.19 | learning rate: 1.336E-05 | global batch size: 16 | lm loss: 5.918054E+00 | grad norm: 0.960 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 2549/ 128728 | consumed samples: 40784 | consumed tokens: 83525632 | elapsed time per iteration (s): 15.20 | learning rate: 1.336E-05 | global batch size: 16 | lm loss: 6.227513E+00 | grad norm: 0.676 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2550/ 128728 | consumed samples: 40800 | consumed tokens: 83558400 | elapsed time per iteration (s): 15.20 | learning rate: 1.337E-05 | global batch size: 16 | lm loss: 6.322637E+00 | grad norm: 0.792 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2551/ 128728 | consumed samples: 40816 | consumed tokens: 83591168 | elapsed time per iteration (s): 15.20 | learning rate: 1.337E-05 | global batch size: 16 | lm loss: 6.055058E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2552/ 128728 | consumed samples: 40832 | consumed tokens: 83623936 | elapsed time per iteration (s): 15.20 | learning rate: 1.338E-05 | global batch size: 16 | lm loss: 6.212307E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2553/ 128728 | consumed samples: 40848 | consumed tokens: 83656704 | elapsed time per iteration (s): 15.16 | learning rate: 1.339E-05 | global batch size: 16 | lm loss: 6.109908E+00 | grad norm: 0.679 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 2554/ 128728 | consumed samples: 40864 | consumed tokens: 83689472 | elapsed time per iteration (s): 15.21 | learning rate: 1.339E-05 | global batch size: 16 | lm loss: 6.312997E+00 | grad norm: 0.780 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2555/ 128728 | consumed samples: 40880 | consumed tokens: 83722240 | elapsed time per iteration (s): 15.20 | learning rate: 1.340E-05 | global batch size: 16 | lm loss: 6.015300E+00 | grad norm: 0.666 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2556/ 128728 | consumed samples: 40896 | consumed tokens: 83755008 | elapsed time per iteration (s): 15.21 | learning rate: 1.340E-05 | global batch size: 16 | lm loss: 6.201309E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2557/ 128728 | consumed samples: 40912 | consumed tokens: 83787776 | elapsed time per iteration (s): 15.17 | learning rate: 1.341E-05 | global batch size: 16 | lm loss: 6.085812E+00 | grad norm: 0.697 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2558/ 128728 | consumed samples: 40928 | consumed tokens: 83820544 | elapsed time per iteration (s): 15.24 | learning rate: 1.341E-05 | global batch size: 16 | lm loss: 6.176955E+00 | grad norm: 0.656 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2559/ 128728 | consumed samples: 40944 | consumed tokens: 83853312 | elapsed time per iteration (s): 15.22 | learning rate: 1.342E-05 | global batch size: 16 | lm loss: 6.113753E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2560/ 128728 | consumed samples: 40960 | consumed tokens: 83886080 | elapsed time per iteration (s): 15.23 | learning rate: 1.342E-05 | global batch size: 16 | lm loss: 6.127898E+00 | grad norm: 0.855 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2561/ 128728 | consumed samples: 40976 | consumed tokens: 83918848 | elapsed time per iteration (s): 15.21 | learning rate: 1.343E-05 | global batch size: 16 | lm loss: 5.811277E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2562/ 128728 | consumed samples: 40992 | consumed tokens: 83951616 | elapsed time per iteration (s): 15.25 | learning rate: 1.343E-05 | global batch size: 16 | lm loss: 5.929381E+00 | grad norm: 0.785 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2563/ 128728 | consumed samples: 41008 | consumed tokens: 83984384 | elapsed time per iteration (s): 15.28 | learning rate: 1.344E-05 | global batch size: 16 | lm loss: 6.315426E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 2564/ 128728 | consumed samples: 41024 | consumed tokens: 84017152 | elapsed time per iteration (s): 15.23 | learning rate: 1.344E-05 | global batch size: 16 | lm loss: 6.300920E+00 | grad norm: 0.744 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2565/ 128728 | consumed samples: 41040 | consumed tokens: 84049920 | elapsed time per iteration (s): 15.24 | learning rate: 1.345E-05 | global batch size: 16 | lm loss: 6.197340E+00 | grad norm: 1.105 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2566/ 128728 | consumed samples: 41056 | consumed tokens: 84082688 | elapsed time per iteration (s): 15.21 | learning rate: 1.345E-05 | global batch size: 16 | lm loss: 6.466086E+00 | grad norm: 0.696 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2567/ 128728 | consumed samples: 41072 | consumed tokens: 84115456 | elapsed time per iteration (s): 15.21 | learning rate: 1.346E-05 | global batch size: 16 | lm loss: 6.310643E+00 | grad norm: 0.945 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2568/ 128728 | consumed samples: 41088 | consumed tokens: 84148224 | elapsed time per iteration (s): 15.22 | learning rate: 1.346E-05 | global batch size: 16 | lm loss: 5.983983E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2569/ 128728 | consumed samples: 41104 | consumed tokens: 84180992 | elapsed time per iteration (s): 15.25 | learning rate: 1.347E-05 | global batch size: 16 | lm loss: 6.170852E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2570/ 128728 | consumed samples: 41120 | consumed tokens: 84213760 | elapsed time per iteration (s): 15.22 | learning rate: 1.347E-05 | global batch size: 16 | lm loss: 6.011130E+00 | grad norm: 1.987 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2571/ 128728 | consumed samples: 41136 | consumed tokens: 84246528 | elapsed time per iteration (s): 15.28 | learning rate: 1.348E-05 | global batch size: 16 | lm loss: 6.224883E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 2572/ 128728 | consumed samples: 41152 | consumed tokens: 84279296 | elapsed time per iteration (s): 15.23 | learning rate: 1.348E-05 | global batch size: 16 | lm loss: 6.098954E+00 | grad norm: 0.916 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2573/ 128728 | consumed samples: 41168 | consumed tokens: 84312064 | elapsed time per iteration (s): 15.24 | learning rate: 1.349E-05 | global batch size: 16 | lm loss: 6.223440E+00 | grad norm: 0.779 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2574/ 128728 | consumed samples: 41184 | consumed tokens: 84344832 | elapsed time per iteration (s): 15.23 | learning rate: 1.350E-05 | global batch size: 16 | lm loss: 6.194892E+00 | grad norm: 0.842 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2575/ 128728 | consumed samples: 41200 | consumed tokens: 84377600 | elapsed time per iteration (s): 15.25 | learning rate: 1.350E-05 | global batch size: 16 | lm loss: 5.966484E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2576/ 128728 | consumed samples: 41216 | consumed tokens: 84410368 | elapsed time per iteration (s): 15.17 | learning rate: 1.351E-05 | global batch size: 16 | lm loss: 6.128808E+00 | grad norm: 0.759 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2577/ 128728 | consumed samples: 41232 | consumed tokens: 84443136 | elapsed time per iteration (s): 15.22 | learning rate: 1.351E-05 | global batch size: 16 | lm loss: 6.020306E+00 | grad norm: 0.695 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2578/ 128728 | consumed samples: 41248 | consumed tokens: 84475904 | elapsed time per iteration (s): 15.24 | learning rate: 1.352E-05 | global batch size: 16 | lm loss: 6.008765E+00 | grad norm: 0.813 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2579/ 128728 | consumed samples: 41264 | consumed tokens: 84508672 | elapsed time per iteration (s): 15.23 | learning rate: 1.352E-05 | global batch size: 16 | lm loss: 6.182366E+00 | grad norm: 0.868 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2580/ 128728 | consumed samples: 41280 | consumed tokens: 84541440 | elapsed time per iteration (s): 15.19 | learning rate: 1.353E-05 | global batch size: 16 | lm loss: 6.212614E+00 | grad norm: 0.657 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 2581/ 128728 | consumed samples: 41296 | consumed tokens: 84574208 | elapsed time per iteration (s): 15.20 | learning rate: 1.353E-05 | global batch size: 16 | lm loss: 6.101485E+00 | grad norm: 0.824 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2582/ 128728 | consumed samples: 41312 | consumed tokens: 84606976 | elapsed time per iteration (s): 15.28 | learning rate: 1.354E-05 | global batch size: 16 | lm loss: 5.973782E+00 | grad norm: 1.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 2583/ 128728 | consumed samples: 41328 | consumed tokens: 84639744 | elapsed time per iteration (s): 15.15 | learning rate: 1.354E-05 | global batch size: 16 | lm loss: 6.166084E+00 | grad norm: 0.673 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 2584/ 128728 | consumed samples: 41344 | consumed tokens: 84672512 | elapsed time per iteration (s): 15.62 | learning rate: 1.355E-05 | global batch size: 16 | lm loss: 6.170146E+00 | grad norm: 0.834 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.024 | TFLOPs: 7.84 | [default7]: iteration 2585/ 128728 | consumed samples: 41360 | consumed tokens: 84705280 | elapsed time per iteration (s): 15.23 | learning rate: 1.355E-05 | global batch size: 16 | lm loss: 6.140182E+00 | grad norm: 0.765 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2586/ 128728 | consumed samples: 41376 | consumed tokens: 84738048 | elapsed time per iteration (s): 15.19 | learning rate: 1.356E-05 | global batch size: 16 | lm loss: 6.219534E+00 | grad norm: 0.738 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2587/ 128728 | consumed samples: 41392 | consumed tokens: 84770816 | elapsed time per iteration (s): 15.23 | learning rate: 1.356E-05 | global batch size: 16 | lm loss: 6.216126E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2588/ 128728 | consumed samples: 41408 | consumed tokens: 84803584 | elapsed time per iteration (s): 15.26 | learning rate: 1.357E-05 | global batch size: 16 | lm loss: 6.328304E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2589/ 128728 | consumed samples: 41424 | consumed tokens: 84836352 | elapsed time per iteration (s): 14.78 | learning rate: 1.357E-05 | global batch size: 16 | lm loss: 6.339481E+00 | grad norm: 7.573 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.082 | TFLOPs: 8.29 | [default7]: iteration 2590/ 128728 | consumed samples: 41440 | consumed tokens: 84869120 | elapsed time per iteration (s): 15.68 | learning rate: 1.358E-05 | global batch size: 16 | lm loss: 6.163667E+00 | grad norm: 0.778 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.020 | TFLOPs: 7.81 | [default7]: iteration 2591/ 128728 | consumed samples: 41456 | consumed tokens: 84901888 | elapsed time per iteration (s): 15.23 | learning rate: 1.358E-05 | global batch size: 16 | lm loss: 6.277731E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2592/ 128728 | consumed samples: 41472 | consumed tokens: 84934656 | elapsed time per iteration (s): 15.20 | learning rate: 1.359E-05 | global batch size: 16 | lm loss: 6.341899E+00 | grad norm: 0.847 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2593/ 128728 | consumed samples: 41488 | consumed tokens: 84967424 | elapsed time per iteration (s): 15.21 | learning rate: 1.359E-05 | global batch size: 16 | lm loss: 6.278369E+00 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2594/ 128728 | consumed samples: 41504 | consumed tokens: 85000192 | elapsed time per iteration (s): 15.18 | learning rate: 1.360E-05 | global batch size: 16 | lm loss: 6.110337E+00 | grad norm: 0.776 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2595/ 128728 | consumed samples: 41520 | consumed tokens: 85032960 | elapsed time per iteration (s): 15.15 | learning rate: 1.361E-05 | global batch size: 16 | lm loss: 6.144503E+00 | grad norm: 1.045 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 2596/ 128728 | consumed samples: 41536 | consumed tokens: 85065728 | elapsed time per iteration (s): 15.23 | learning rate: 1.361E-05 | global batch size: 16 | lm loss: 5.943759E+00 | grad norm: 0.786 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2597/ 128728 | consumed samples: 41552 | consumed tokens: 85098496 | elapsed time per iteration (s): 15.04 | learning rate: 1.362E-05 | global batch size: 16 | lm loss: 6.147716E+00 | grad norm: 1.372 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.064 | TFLOPs: 8.14 | [default7]: iteration 2598/ 128728 | consumed samples: 41568 | consumed tokens: 85131264 | elapsed time per iteration (s): 14.94 | learning rate: 1.362E-05 | global batch size: 16 | lm loss: 6.098436E+00 | grad norm: 0.801 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.071 | TFLOPs: 8.20 | [default7]: iteration 2599/ 128728 | consumed samples: 41584 | consumed tokens: 85164032 | elapsed time per iteration (s): 15.26 | learning rate: 1.363E-05 | global batch size: 16 | lm loss: 6.340265E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 2600/ 128728 | consumed samples: 41600 | consumed tokens: 85196800 | elapsed time per iteration (s): 15.19 | learning rate: 1.363E-05 | global batch size: 16 | lm loss: 6.089229E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 2601/ 128728 | consumed samples: 41616 | consumed tokens: 85229568 | elapsed time per iteration (s): 15.26 | learning rate: 1.364E-05 | global batch size: 16 | lm loss: 6.258206E+00 | grad norm: 0.967 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2602/ 128728 | consumed samples: 41632 | consumed tokens: 85262336 | elapsed time per iteration (s): 15.24 | learning rate: 1.364E-05 | global batch size: 16 | lm loss: 6.186719E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2603/ 128728 | consumed samples: 41648 | consumed tokens: 85295104 | elapsed time per iteration (s): 15.21 | learning rate: 1.365E-05 | global batch size: 16 | lm loss: 6.095049E+00 | grad norm: 0.792 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2604/ 128728 | consumed samples: 41664 | consumed tokens: 85327872 | elapsed time per iteration (s): 15.23 | learning rate: 1.365E-05 | global batch size: 16 | lm loss: 6.124999E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2605/ 128728 | consumed samples: 41680 | consumed tokens: 85360640 | elapsed time per iteration (s): 15.23 | learning rate: 1.366E-05 | global batch size: 16 | lm loss: 5.955814E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2606/ 128728 | consumed samples: 41696 | consumed tokens: 85393408 | elapsed time per iteration (s): 15.22 | learning rate: 1.366E-05 | global batch size: 16 | lm loss: 5.977965E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2607/ 128728 | consumed samples: 41712 | consumed tokens: 85426176 | elapsed time per iteration (s): 15.22 | learning rate: 1.367E-05 | global batch size: 16 | lm loss: 6.389388E+00 | grad norm: 0.888 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2608/ 128728 | consumed samples: 41728 | consumed tokens: 85458944 | elapsed time per iteration (s): 15.13 | learning rate: 1.367E-05 | global batch size: 16 | lm loss: 5.978179E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.058 | TFLOPs: 8.10 | [default7]: iteration 2609/ 128728 | consumed samples: 41744 | consumed tokens: 85491712 | elapsed time per iteration (s): 15.16 | learning rate: 1.368E-05 | global batch size: 16 | lm loss: 6.004305E+00 | grad norm: 0.843 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2610/ 128728 | consumed samples: 41760 | consumed tokens: 85524480 | elapsed time per iteration (s): 15.21 | learning rate: 1.368E-05 | global batch size: 16 | lm loss: 6.251733E+00 | grad norm: 0.834 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2611/ 128728 | consumed samples: 41776 | consumed tokens: 85557248 | elapsed time per iteration (s): 15.21 | learning rate: 1.369E-05 | global batch size: 16 | lm loss: 6.239475E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2612/ 128728 | consumed samples: 41792 | consumed tokens: 85590016 | elapsed time per iteration (s): 15.23 | learning rate: 1.369E-05 | global batch size: 16 | lm loss: 6.155906E+00 | grad norm: 0.777 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2613/ 128728 | consumed samples: 41808 | consumed tokens: 85622784 | elapsed time per iteration (s): 15.22 | learning rate: 1.370E-05 | global batch size: 16 | lm loss: 6.004362E+00 | grad norm: 1.437 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2614/ 128728 | consumed samples: 41824 | consumed tokens: 85655552 | elapsed time per iteration (s): 15.23 | learning rate: 1.370E-05 | global batch size: 16 | lm loss: 6.035411E+00 | grad norm: 0.887 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2615/ 128728 | consumed samples: 41840 | consumed tokens: 85688320 | elapsed time per iteration (s): 15.22 | learning rate: 1.371E-05 | global batch size: 16 | lm loss: 6.144122E+00 | grad norm: 0.807 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2616/ 128728 | consumed samples: 41856 | consumed tokens: 85721088 | elapsed time per iteration (s): 15.19 | learning rate: 1.372E-05 | global batch size: 16 | lm loss: 5.992271E+00 | grad norm: 0.927 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 2617/ 128728 | consumed samples: 41872 | consumed tokens: 85753856 | elapsed time per iteration (s): 15.15 | learning rate: 1.372E-05 | global batch size: 16 | lm loss: 6.198568E+00 | grad norm: 0.839 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 2618/ 128728 | consumed samples: 41888 | consumed tokens: 85786624 | elapsed time per iteration (s): 15.26 | learning rate: 1.373E-05 | global batch size: 16 | lm loss: 6.315387E+00 | grad norm: 1.224 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2619/ 128728 | consumed samples: 41904 | consumed tokens: 85819392 | elapsed time per iteration (s): 15.21 | learning rate: 1.373E-05 | global batch size: 16 | lm loss: 6.252526E+00 | grad norm: 0.710 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2620/ 128728 | consumed samples: 41920 | consumed tokens: 85852160 | elapsed time per iteration (s): 15.26 | learning rate: 1.374E-05 | global batch size: 16 | lm loss: 6.126781E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2621/ 128728 | consumed samples: 41936 | consumed tokens: 85884928 | elapsed time per iteration (s): 15.24 | learning rate: 1.374E-05 | global batch size: 16 | lm loss: 6.050614E+00 | grad norm: 0.782 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2622/ 128728 | consumed samples: 41952 | consumed tokens: 85917696 | elapsed time per iteration (s): 15.22 | learning rate: 1.375E-05 | global batch size: 16 | lm loss: 6.248853E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2623/ 128728 | consumed samples: 41968 | consumed tokens: 85950464 | elapsed time per iteration (s): 15.24 | learning rate: 1.375E-05 | global batch size: 16 | lm loss: 5.849868E+00 | grad norm: 0.825 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2624/ 128728 | consumed samples: 41984 | consumed tokens: 85983232 | elapsed time per iteration (s): 15.21 | learning rate: 1.376E-05 | global batch size: 16 | lm loss: 6.024261E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2625/ 128728 | consumed samples: 42000 | consumed tokens: 86016000 | elapsed time per iteration (s): 15.21 | learning rate: 1.376E-05 | global batch size: 16 | lm loss: 6.284721E+00 | grad norm: 0.891 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2626/ 128728 | consumed samples: 42016 | consumed tokens: 86048768 | elapsed time per iteration (s): 15.21 | learning rate: 1.377E-05 | global batch size: 16 | lm loss: 6.214346E+00 | grad norm: 0.816 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2627/ 128728 | consumed samples: 42032 | consumed tokens: 86081536 | elapsed time per iteration (s): 15.23 | learning rate: 1.377E-05 | global batch size: 16 | lm loss: 6.019969E+00 | grad norm: 0.727 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2628/ 128728 | consumed samples: 42048 | consumed tokens: 86114304 | elapsed time per iteration (s): 15.23 | learning rate: 1.378E-05 | global batch size: 16 | lm loss: 6.116952E+00 | grad norm: 0.782 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2629/ 128728 | consumed samples: 42064 | consumed tokens: 86147072 | elapsed time per iteration (s): 15.28 | learning rate: 1.378E-05 | global batch size: 16 | lm loss: 6.207554E+00 | grad norm: 0.829 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 2630/ 128728 | consumed samples: 42080 | consumed tokens: 86179840 | elapsed time per iteration (s): 15.17 | learning rate: 1.379E-05 | global batch size: 16 | lm loss: 6.012637E+00 | grad norm: 0.908 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2631/ 128728 | consumed samples: 42096 | consumed tokens: 86212608 | elapsed time per iteration (s): 15.21 | learning rate: 1.379E-05 | global batch size: 16 | lm loss: 6.151033E+00 | grad norm: 0.674 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2632/ 128728 | consumed samples: 42112 | consumed tokens: 86245376 | elapsed time per iteration (s): 15.25 | learning rate: 1.380E-05 | global batch size: 16 | lm loss: 5.952607E+00 | grad norm: 1.420 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2633/ 128728 | consumed samples: 42128 | consumed tokens: 86278144 | elapsed time per iteration (s): 15.21 | learning rate: 1.380E-05 | global batch size: 16 | lm loss: 6.403392E+00 | grad norm: 0.910 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2634/ 128728 | consumed samples: 42144 | consumed tokens: 86310912 | elapsed time per iteration (s): 15.24 | learning rate: 1.381E-05 | global batch size: 16 | lm loss: 6.004301E+00 | grad norm: 0.635 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2635/ 128728 | consumed samples: 42160 | consumed tokens: 86343680 | elapsed time per iteration (s): 15.23 | learning rate: 1.382E-05 | global batch size: 16 | lm loss: 6.076480E+00 | grad norm: 0.885 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2636/ 128728 | consumed samples: 42176 | consumed tokens: 86376448 | elapsed time per iteration (s): 15.20 | learning rate: 1.382E-05 | global batch size: 16 | lm loss: 5.914468E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2637/ 128728 | consumed samples: 42192 | consumed tokens: 86409216 | elapsed time per iteration (s): 15.24 | learning rate: 1.383E-05 | global batch size: 16 | lm loss: 6.257906E+00 | grad norm: 0.710 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2638/ 128728 | consumed samples: 42208 | consumed tokens: 86441984 | elapsed time per iteration (s): 15.23 | learning rate: 1.383E-05 | global batch size: 16 | lm loss: 6.093236E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2639/ 128728 | consumed samples: 42224 | consumed tokens: 86474752 | elapsed time per iteration (s): 15.23 | learning rate: 1.384E-05 | global batch size: 16 | lm loss: 6.195016E+00 | grad norm: 0.674 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2640/ 128728 | consumed samples: 42240 | consumed tokens: 86507520 | elapsed time per iteration (s): 15.22 | learning rate: 1.384E-05 | global batch size: 16 | lm loss: 6.247684E+00 | grad norm: 0.985 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2641/ 128728 | consumed samples: 42256 | consumed tokens: 86540288 | elapsed time per iteration (s): 15.24 | learning rate: 1.385E-05 | global batch size: 16 | lm loss: 6.176681E+00 | grad norm: 0.701 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2642/ 128728 | consumed samples: 42272 | consumed tokens: 86573056 | elapsed time per iteration (s): 15.16 | learning rate: 1.385E-05 | global batch size: 16 | lm loss: 6.208982E+00 | grad norm: 0.803 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2643/ 128728 | consumed samples: 42288 | consumed tokens: 86605824 | elapsed time per iteration (s): 15.21 | learning rate: 1.386E-05 | global batch size: 16 | lm loss: 5.945809E+00 | grad norm: 0.915 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2644/ 128728 | consumed samples: 42304 | consumed tokens: 86638592 | elapsed time per iteration (s): 15.24 | learning rate: 1.386E-05 | global batch size: 16 | lm loss: 6.031917E+00 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2645/ 128728 | consumed samples: 42320 | consumed tokens: 86671360 | elapsed time per iteration (s): 15.24 | learning rate: 1.387E-05 | global batch size: 16 | lm loss: 6.110291E+00 | grad norm: 0.721 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2646/ 128728 | consumed samples: 42336 | consumed tokens: 86704128 | elapsed time per iteration (s): 15.20 | learning rate: 1.387E-05 | global batch size: 16 | lm loss: 6.099847E+00 | grad norm: 0.661 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2647/ 128728 | consumed samples: 42352 | consumed tokens: 86736896 | elapsed time per iteration (s): 15.21 | learning rate: 1.388E-05 | global batch size: 16 | lm loss: 5.954161E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2648/ 128728 | consumed samples: 42368 | consumed tokens: 86769664 | elapsed time per iteration (s): 15.20 | learning rate: 1.388E-05 | global batch size: 16 | lm loss: 6.157164E+00 | grad norm: 0.727 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2649/ 128728 | consumed samples: 42384 | consumed tokens: 86802432 | elapsed time per iteration (s): 15.19 | learning rate: 1.389E-05 | global batch size: 16 | lm loss: 6.200693E+00 | grad norm: 0.702 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2650/ 128728 | consumed samples: 42400 | consumed tokens: 86835200 | elapsed time per iteration (s): 15.19 | learning rate: 1.389E-05 | global batch size: 16 | lm loss: 6.027765E+00 | grad norm: 0.711 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2651/ 128728 | consumed samples: 42416 | consumed tokens: 86867968 | elapsed time per iteration (s): 15.21 | learning rate: 1.390E-05 | global batch size: 16 | lm loss: 6.053953E+00 | grad norm: 0.699 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2652/ 128728 | consumed samples: 42432 | consumed tokens: 86900736 | elapsed time per iteration (s): 15.22 | learning rate: 1.390E-05 | global batch size: 16 | lm loss: 6.035723E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2653/ 128728 | consumed samples: 42448 | consumed tokens: 86933504 | elapsed time per iteration (s): 15.24 | learning rate: 1.391E-05 | global batch size: 16 | lm loss: 6.109938E+00 | grad norm: 0.755 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2654/ 128728 | consumed samples: 42464 | consumed tokens: 86966272 | elapsed time per iteration (s): 15.24 | learning rate: 1.391E-05 | global batch size: 16 | lm loss: 6.273432E+00 | grad norm: 0.983 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2655/ 128728 | consumed samples: 42480 | consumed tokens: 86999040 | elapsed time per iteration (s): 15.23 | learning rate: 1.392E-05 | global batch size: 16 | lm loss: 6.122913E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2656/ 128728 | consumed samples: 42496 | consumed tokens: 87031808 | elapsed time per iteration (s): 15.21 | learning rate: 1.393E-05 | global batch size: 16 | lm loss: 6.010287E+00 | grad norm: 0.667 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2657/ 128728 | consumed samples: 42512 | consumed tokens: 87064576 | elapsed time per iteration (s): 15.26 | learning rate: 1.393E-05 | global batch size: 16 | lm loss: 6.254018E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2658/ 128728 | consumed samples: 42528 | consumed tokens: 87097344 | elapsed time per iteration (s): 15.21 | learning rate: 1.394E-05 | global batch size: 16 | lm loss: 6.063126E+00 | grad norm: 0.657 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2659/ 128728 | consumed samples: 42544 | consumed tokens: 87130112 | elapsed time per iteration (s): 15.25 | learning rate: 1.394E-05 | global batch size: 16 | lm loss: 6.102057E+00 | grad norm: 1.325 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2660/ 128728 | consumed samples: 42560 | consumed tokens: 87162880 | elapsed time per iteration (s): 15.17 | learning rate: 1.395E-05 | global batch size: 16 | lm loss: 5.900523E+00 | grad norm: 0.755 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2661/ 128728 | consumed samples: 42576 | consumed tokens: 87195648 | elapsed time per iteration (s): 15.27 | learning rate: 1.395E-05 | global batch size: 16 | lm loss: 6.192137E+00 | grad norm: 0.843 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 2662/ 128728 | consumed samples: 42592 | consumed tokens: 87228416 | elapsed time per iteration (s): 15.19 | learning rate: 1.396E-05 | global batch size: 16 | lm loss: 6.078798E+00 | grad norm: 0.720 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 2663/ 128728 | consumed samples: 42608 | consumed tokens: 87261184 | elapsed time per iteration (s): 15.14 | learning rate: 1.396E-05 | global batch size: 16 | lm loss: 6.099391E+00 | grad norm: 0.702 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 2664/ 128728 | consumed samples: 42624 | consumed tokens: 87293952 | elapsed time per iteration (s): 15.22 | learning rate: 1.397E-05 | global batch size: 16 | lm loss: 6.029523E+00 | grad norm: 0.958 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2665/ 128728 | consumed samples: 42640 | consumed tokens: 87326720 | elapsed time per iteration (s): 15.27 | learning rate: 1.397E-05 | global batch size: 16 | lm loss: 5.808035E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 2666/ 128728 | consumed samples: 42656 | consumed tokens: 87359488 | elapsed time per iteration (s): 15.24 | learning rate: 1.398E-05 | global batch size: 16 | lm loss: 6.260103E+00 | grad norm: 0.727 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2667/ 128728 | consumed samples: 42672 | consumed tokens: 87392256 | elapsed time per iteration (s): 15.18 | learning rate: 1.398E-05 | global batch size: 16 | lm loss: 6.125736E+00 | grad norm: 0.744 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2668/ 128728 | consumed samples: 42688 | consumed tokens: 87425024 | elapsed time per iteration (s): 15.23 | learning rate: 1.399E-05 | global batch size: 16 | lm loss: 5.999493E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2669/ 128728 | consumed samples: 42704 | consumed tokens: 87457792 | elapsed time per iteration (s): 15.23 | learning rate: 1.399E-05 | global batch size: 16 | lm loss: 6.127405E+00 | grad norm: 0.638 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2670/ 128728 | consumed samples: 42720 | consumed tokens: 87490560 | elapsed time per iteration (s): 15.26 | learning rate: 1.400E-05 | global batch size: 16 | lm loss: 6.203554E+00 | grad norm: 0.716 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 2671/ 128728 | consumed samples: 42736 | consumed tokens: 87523328 | elapsed time per iteration (s): 15.26 | learning rate: 1.400E-05 | global batch size: 16 | lm loss: 6.156468E+00 | grad norm: 0.937 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2672/ 128728 | consumed samples: 42752 | consumed tokens: 87556096 | elapsed time per iteration (s): 15.21 | learning rate: 1.401E-05 | global batch size: 16 | lm loss: 6.088578E+00 | grad norm: 0.710 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2673/ 128728 | consumed samples: 42768 | consumed tokens: 87588864 | elapsed time per iteration (s): 15.22 | learning rate: 1.401E-05 | global batch size: 16 | lm loss: 6.113354E+00 | grad norm: 0.642 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2674/ 128728 | consumed samples: 42784 | consumed tokens: 87621632 | elapsed time per iteration (s): 15.27 | learning rate: 1.402E-05 | global batch size: 16 | lm loss: 6.172616E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 2675/ 128728 | consumed samples: 42800 | consumed tokens: 87654400 | elapsed time per iteration (s): 15.22 | learning rate: 1.402E-05 | global batch size: 16 | lm loss: 6.198242E+00 | grad norm: 0.830 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2676/ 128728 | consumed samples: 42816 | consumed tokens: 87687168 | elapsed time per iteration (s): 15.25 | learning rate: 1.403E-05 | global batch size: 16 | lm loss: 5.941981E+00 | grad norm: 0.828 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2677/ 128728 | consumed samples: 42832 | consumed tokens: 87719936 | elapsed time per iteration (s): 15.23 | learning rate: 1.404E-05 | global batch size: 16 | lm loss: 5.984716E+00 | grad norm: 0.679 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2678/ 128728 | consumed samples: 42848 | consumed tokens: 87752704 | elapsed time per iteration (s): 15.22 | learning rate: 1.404E-05 | global batch size: 16 | lm loss: 6.288304E+00 | grad norm: 0.767 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2679/ 128728 | consumed samples: 42864 | consumed tokens: 87785472 | elapsed time per iteration (s): 15.26 | learning rate: 1.405E-05 | global batch size: 16 | lm loss: 5.836905E+00 | grad norm: 0.705 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2680/ 128728 | consumed samples: 42880 | consumed tokens: 87818240 | elapsed time per iteration (s): 15.22 | learning rate: 1.405E-05 | global batch size: 16 | lm loss: 5.946983E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2681/ 128728 | consumed samples: 42896 | consumed tokens: 87851008 | elapsed time per iteration (s): 15.26 | learning rate: 1.406E-05 | global batch size: 16 | lm loss: 5.952541E+00 | grad norm: 0.803 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 2682/ 128728 | consumed samples: 42912 | consumed tokens: 87883776 | elapsed time per iteration (s): 15.20 | learning rate: 1.406E-05 | global batch size: 16 | lm loss: 6.209697E+00 | grad norm: 0.779 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2683/ 128728 | consumed samples: 42928 | consumed tokens: 87916544 | elapsed time per iteration (s): 15.22 | learning rate: 1.407E-05 | global batch size: 16 | lm loss: 6.055411E+00 | grad norm: 0.670 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2684/ 128728 | consumed samples: 42944 | consumed tokens: 87949312 | elapsed time per iteration (s): 15.18 | learning rate: 1.407E-05 | global batch size: 16 | lm loss: 6.116272E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2685/ 128728 | consumed samples: 42960 | consumed tokens: 87982080 | elapsed time per iteration (s): 15.20 | learning rate: 1.408E-05 | global batch size: 16 | lm loss: 6.151689E+00 | grad norm: 0.701 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2686/ 128728 | consumed samples: 42976 | consumed tokens: 88014848 | elapsed time per iteration (s): 15.21 | learning rate: 1.408E-05 | global batch size: 16 | lm loss: 6.154226E+00 | grad norm: 0.661 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2687/ 128728 | consumed samples: 42992 | consumed tokens: 88047616 | elapsed time per iteration (s): 15.26 | learning rate: 1.409E-05 | global batch size: 16 | lm loss: 5.946739E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2688/ 128728 | consumed samples: 43008 | consumed tokens: 88080384 | elapsed time per iteration (s): 15.23 | learning rate: 1.409E-05 | global batch size: 16 | lm loss: 5.993872E+00 | grad norm: 1.039 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2689/ 128728 | consumed samples: 43024 | consumed tokens: 88113152 | elapsed time per iteration (s): 15.25 | learning rate: 1.410E-05 | global batch size: 16 | lm loss: 6.235291E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.04 | [default7]: iteration 2690/ 128728 | consumed samples: 43040 | consumed tokens: 88145920 | elapsed time per iteration (s): 15.23 | learning rate: 1.410E-05 | global batch size: 16 | lm loss: 6.016863E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2691/ 128728 | consumed samples: 43056 | consumed tokens: 88178688 | elapsed time per iteration (s): 15.21 | learning rate: 1.411E-05 | global batch size: 16 | lm loss: 6.055921E+00 | grad norm: 0.923 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2692/ 128728 | consumed samples: 43072 | consumed tokens: 88211456 | elapsed time per iteration (s): 15.29 | learning rate: 1.411E-05 | global batch size: 16 | lm loss: 6.147404E+00 | grad norm: 1.075 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.01 | [default7]: iteration 2693/ 128728 | consumed samples: 43088 | consumed tokens: 88244224 | elapsed time per iteration (s): 15.25 | learning rate: 1.412E-05 | global batch size: 16 | lm loss: 6.023476E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2694/ 128728 | consumed samples: 43104 | consumed tokens: 88276992 | elapsed time per iteration (s): 15.21 | learning rate: 1.412E-05 | global batch size: 16 | lm loss: 6.106614E+00 | grad norm: 0.689 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2695/ 128728 | consumed samples: 43120 | consumed tokens: 88309760 | elapsed time per iteration (s): 15.19 | learning rate: 1.413E-05 | global batch size: 16 | lm loss: 6.147112E+00 | grad norm: 0.985 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2696/ 128728 | consumed samples: 43136 | consumed tokens: 88342528 | elapsed time per iteration (s): 15.26 | learning rate: 1.413E-05 | global batch size: 16 | lm loss: 6.314603E+00 | grad norm: 0.802 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 2697/ 128728 | consumed samples: 43152 | consumed tokens: 88375296 | elapsed time per iteration (s): 15.22 | learning rate: 1.414E-05 | global batch size: 16 | lm loss: 6.222948E+00 | grad norm: 0.918 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2698/ 128728 | consumed samples: 43168 | consumed tokens: 88408064 | elapsed time per iteration (s): 15.22 | learning rate: 1.415E-05 | global batch size: 16 | lm loss: 6.301141E+00 | grad norm: 0.960 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2699/ 128728 | consumed samples: 43184 | consumed tokens: 88440832 | elapsed time per iteration (s): 15.21 | learning rate: 1.415E-05 | global batch size: 16 | lm loss: 6.171608E+00 | grad norm: 0.869 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2700/ 128728 | consumed samples: 43200 | consumed tokens: 88473600 | elapsed time per iteration (s): 15.23 | learning rate: 1.416E-05 | global batch size: 16 | lm loss: 6.046288E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2701/ 128728 | consumed samples: 43216 | consumed tokens: 88506368 | elapsed time per iteration (s): 15.23 | learning rate: 1.416E-05 | global batch size: 16 | lm loss: 6.242400E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2702/ 128728 | consumed samples: 43232 | consumed tokens: 88539136 | elapsed time per iteration (s): 15.18 | learning rate: 1.417E-05 | global batch size: 16 | lm loss: 6.125425E+00 | grad norm: 0.869 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2703/ 128728 | consumed samples: 43248 | consumed tokens: 88571904 | elapsed time per iteration (s): 15.22 | learning rate: 1.417E-05 | global batch size: 16 | lm loss: 6.006852E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2704/ 128728 | consumed samples: 43264 | consumed tokens: 88604672 | elapsed time per iteration (s): 15.23 | learning rate: 1.418E-05 | global batch size: 16 | lm loss: 5.967431E+00 | grad norm: 0.782 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2705/ 128728 | consumed samples: 43280 | consumed tokens: 88637440 | elapsed time per iteration (s): 15.19 | learning rate: 1.418E-05 | global batch size: 16 | lm loss: 5.898385E+00 | grad norm: 0.868 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2706/ 128728 | consumed samples: 43296 | consumed tokens: 88670208 | elapsed time per iteration (s): 15.25 | learning rate: 1.419E-05 | global batch size: 16 | lm loss: 6.021958E+00 | grad norm: 0.792 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2707/ 128728 | consumed samples: 43312 | consumed tokens: 88702976 | elapsed time per iteration (s): 15.20 | learning rate: 1.419E-05 | global batch size: 16 | lm loss: 6.094849E+00 | grad norm: 0.741 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2708/ 128728 | consumed samples: 43328 | consumed tokens: 88735744 | elapsed time per iteration (s): 15.22 | learning rate: 1.420E-05 | global batch size: 16 | lm loss: 6.081100E+00 | grad norm: 0.799 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2709/ 128728 | consumed samples: 43344 | consumed tokens: 88768512 | elapsed time per iteration (s): 15.22 | learning rate: 1.420E-05 | global batch size: 16 | lm loss: 6.196400E+00 | grad norm: 0.849 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2710/ 128728 | consumed samples: 43360 | consumed tokens: 88801280 | elapsed time per iteration (s): 15.21 | learning rate: 1.421E-05 | global batch size: 16 | lm loss: 5.977609E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2711/ 128728 | consumed samples: 43376 | consumed tokens: 88834048 | elapsed time per iteration (s): 15.21 | learning rate: 1.421E-05 | global batch size: 16 | lm loss: 6.154242E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2712/ 128728 | consumed samples: 43392 | consumed tokens: 88866816 | elapsed time per iteration (s): 15.20 | learning rate: 1.422E-05 | global batch size: 16 | lm loss: 6.021749E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2713/ 128728 | consumed samples: 43408 | consumed tokens: 88899584 | elapsed time per iteration (s): 15.19 | learning rate: 1.422E-05 | global batch size: 16 | lm loss: 6.182495E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 2714/ 128728 | consumed samples: 43424 | consumed tokens: 88932352 | elapsed time per iteration (s): 15.21 | learning rate: 1.423E-05 | global batch size: 16 | lm loss: 5.947534E+00 | grad norm: 0.847 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2715/ 128728 | consumed samples: 43440 | consumed tokens: 88965120 | elapsed time per iteration (s): 15.24 | learning rate: 1.423E-05 | global batch size: 16 | lm loss: 5.916839E+00 | grad norm: 0.629 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2716/ 128728 | consumed samples: 43456 | consumed tokens: 88997888 | elapsed time per iteration (s): 15.21 | learning rate: 1.424E-05 | global batch size: 16 | lm loss: 5.961128E+00 | grad norm: 0.695 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2717/ 128728 | consumed samples: 43472 | consumed tokens: 89030656 | elapsed time per iteration (s): 15.27 | learning rate: 1.424E-05 | global batch size: 16 | lm loss: 6.250119E+00 | grad norm: 0.670 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 2718/ 128728 | consumed samples: 43488 | consumed tokens: 89063424 | elapsed time per iteration (s): 15.19 | learning rate: 1.425E-05 | global batch size: 16 | lm loss: 6.063711E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2719/ 128728 | consumed samples: 43504 | consumed tokens: 89096192 | elapsed time per iteration (s): 15.25 | learning rate: 1.426E-05 | global batch size: 16 | lm loss: 5.790985E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2720/ 128728 | consumed samples: 43520 | consumed tokens: 89128960 | elapsed time per iteration (s): 15.22 | learning rate: 1.426E-05 | global batch size: 16 | lm loss: 6.230259E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2721/ 128728 | consumed samples: 43536 | consumed tokens: 89161728 | elapsed time per iteration (s): 15.20 | learning rate: 1.427E-05 | global batch size: 16 | lm loss: 6.079679E+00 | grad norm: 0.846 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2722/ 128728 | consumed samples: 43552 | consumed tokens: 89194496 | elapsed time per iteration (s): 15.22 | learning rate: 1.427E-05 | global batch size: 16 | lm loss: 6.003428E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2723/ 128728 | consumed samples: 43568 | consumed tokens: 89227264 | elapsed time per iteration (s): 15.17 | learning rate: 1.428E-05 | global batch size: 16 | lm loss: 6.202793E+00 | grad norm: 0.695 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2724/ 128728 | consumed samples: 43584 | consumed tokens: 89260032 | elapsed time per iteration (s): 15.22 | learning rate: 1.428E-05 | global batch size: 16 | lm loss: 5.948997E+00 | grad norm: 0.636 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2725/ 128728 | consumed samples: 43600 | consumed tokens: 89292800 | elapsed time per iteration (s): 15.17 | learning rate: 1.429E-05 | global batch size: 16 | lm loss: 6.143008E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2726/ 128728 | consumed samples: 43616 | consumed tokens: 89325568 | elapsed time per iteration (s): 15.23 | learning rate: 1.429E-05 | global batch size: 16 | lm loss: 6.032366E+00 | grad norm: 0.699 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2727/ 128728 | consumed samples: 43632 | consumed tokens: 89358336 | elapsed time per iteration (s): 15.13 | learning rate: 1.430E-05 | global batch size: 16 | lm loss: 6.206609E+00 | grad norm: 0.814 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 2728/ 128728 | consumed samples: 43648 | consumed tokens: 89391104 | elapsed time per iteration (s): 15.15 | learning rate: 1.430E-05 | global batch size: 16 | lm loss: 5.929503E+00 | grad norm: 0.673 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 2729/ 128728 | consumed samples: 43664 | consumed tokens: 89423872 | elapsed time per iteration (s): 15.19 | learning rate: 1.431E-05 | global batch size: 16 | lm loss: 6.076304E+00 | grad norm: 0.654 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2730/ 128728 | consumed samples: 43680 | consumed tokens: 89456640 | elapsed time per iteration (s): 15.15 | learning rate: 1.431E-05 | global batch size: 16 | lm loss: 6.175723E+00 | grad norm: 0.792 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 2731/ 128728 | consumed samples: 43696 | consumed tokens: 89489408 | elapsed time per iteration (s): 15.25 | learning rate: 1.432E-05 | global batch size: 16 | lm loss: 6.105374E+00 | grad norm: 0.698 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2732/ 128728 | consumed samples: 43712 | consumed tokens: 89522176 | elapsed time per iteration (s): 15.21 | learning rate: 1.432E-05 | global batch size: 16 | lm loss: 6.372894E+00 | grad norm: 0.855 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2733/ 128728 | consumed samples: 43728 | consumed tokens: 89554944 | elapsed time per iteration (s): 15.23 | learning rate: 1.433E-05 | global batch size: 16 | lm loss: 6.022964E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2734/ 128728 | consumed samples: 43744 | consumed tokens: 89587712 | elapsed time per iteration (s): 15.24 | learning rate: 1.433E-05 | global batch size: 16 | lm loss: 5.931406E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2735/ 128728 | consumed samples: 43760 | consumed tokens: 89620480 | elapsed time per iteration (s): 15.23 | learning rate: 1.434E-05 | global batch size: 16 | lm loss: 6.318775E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2736/ 128728 | consumed samples: 43776 | consumed tokens: 89653248 | elapsed time per iteration (s): 15.20 | learning rate: 1.434E-05 | global batch size: 16 | lm loss: 5.932520E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2737/ 128728 | consumed samples: 43792 | consumed tokens: 89686016 | elapsed time per iteration (s): 15.25 | learning rate: 1.435E-05 | global batch size: 16 | lm loss: 5.937093E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2738/ 128728 | consumed samples: 43808 | consumed tokens: 89718784 | elapsed time per iteration (s): 15.20 | learning rate: 1.436E-05 | global batch size: 16 | lm loss: 6.135614E+00 | grad norm: 0.836 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2739/ 128728 | consumed samples: 43824 | consumed tokens: 89751552 | elapsed time per iteration (s): 15.24 | learning rate: 1.436E-05 | global batch size: 16 | lm loss: 6.076690E+00 | grad norm: 0.780 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2740/ 128728 | consumed samples: 43840 | consumed tokens: 89784320 | elapsed time per iteration (s): 15.25 | learning rate: 1.437E-05 | global batch size: 16 | lm loss: 5.929999E+00 | grad norm: 0.923 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2741/ 128728 | consumed samples: 43856 | consumed tokens: 89817088 | elapsed time per iteration (s): 15.20 | learning rate: 1.437E-05 | global batch size: 16 | lm loss: 6.232322E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2742/ 128728 | consumed samples: 43872 | consumed tokens: 89849856 | elapsed time per iteration (s): 15.21 | learning rate: 1.438E-05 | global batch size: 16 | lm loss: 6.364085E+00 | grad norm: 0.682 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2743/ 128728 | consumed samples: 43888 | consumed tokens: 89882624 | elapsed time per iteration (s): 15.26 | learning rate: 1.438E-05 | global batch size: 16 | lm loss: 5.733549E+00 | grad norm: 1.389 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 2744/ 128728 | consumed samples: 43904 | consumed tokens: 89915392 | elapsed time per iteration (s): 15.21 | learning rate: 1.439E-05 | global batch size: 16 | lm loss: 5.822972E+00 | grad norm: 0.652 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2745/ 128728 | consumed samples: 43920 | consumed tokens: 89948160 | elapsed time per iteration (s): 15.24 | learning rate: 1.439E-05 | global batch size: 16 | lm loss: 6.177995E+00 | grad norm: 0.978 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2746/ 128728 | consumed samples: 43936 | consumed tokens: 89980928 | elapsed time per iteration (s): 15.15 | learning rate: 1.440E-05 | global batch size: 16 | lm loss: 6.296174E+00 | grad norm: 0.904 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 2747/ 128728 | consumed samples: 43952 | consumed tokens: 90013696 | elapsed time per iteration (s): 15.23 | learning rate: 1.440E-05 | global batch size: 16 | lm loss: 6.298337E+00 | grad norm: 1.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2748/ 128728 | consumed samples: 43968 | consumed tokens: 90046464 | elapsed time per iteration (s): 15.25 | learning rate: 1.441E-05 | global batch size: 16 | lm loss: 6.149353E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2749/ 128728 | consumed samples: 43984 | consumed tokens: 90079232 | elapsed time per iteration (s): 15.22 | learning rate: 1.441E-05 | global batch size: 16 | lm loss: 5.981178E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2750/ 128728 | consumed samples: 44000 | consumed tokens: 90112000 | elapsed time per iteration (s): 15.19 | learning rate: 1.442E-05 | global batch size: 16 | lm loss: 6.031982E+00 | grad norm: 0.694 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2751/ 128728 | consumed samples: 44016 | consumed tokens: 90144768 | elapsed time per iteration (s): 15.16 | learning rate: 1.442E-05 | global batch size: 16 | lm loss: 5.927257E+00 | grad norm: 0.918 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 2752/ 128728 | consumed samples: 44032 | consumed tokens: 90177536 | elapsed time per iteration (s): 15.23 | learning rate: 1.443E-05 | global batch size: 16 | lm loss: 5.992155E+00 | grad norm: 0.815 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2753/ 128728 | consumed samples: 44048 | consumed tokens: 90210304 | elapsed time per iteration (s): 15.24 | learning rate: 1.443E-05 | global batch size: 16 | lm loss: 6.082148E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2754/ 128728 | consumed samples: 44064 | consumed tokens: 90243072 | elapsed time per iteration (s): 15.21 | learning rate: 1.444E-05 | global batch size: 16 | lm loss: 5.980026E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2755/ 128728 | consumed samples: 44080 | consumed tokens: 90275840 | elapsed time per iteration (s): 15.21 | learning rate: 1.444E-05 | global batch size: 16 | lm loss: 6.085819E+00 | grad norm: 0.642 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2756/ 128728 | consumed samples: 44096 | consumed tokens: 90308608 | elapsed time per iteration (s): 15.22 | learning rate: 1.445E-05 | global batch size: 16 | lm loss: 6.038049E+00 | grad norm: 0.866 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2757/ 128728 | consumed samples: 44112 | consumed tokens: 90341376 | elapsed time per iteration (s): 15.23 | learning rate: 1.445E-05 | global batch size: 16 | lm loss: 5.992010E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2758/ 128728 | consumed samples: 44128 | consumed tokens: 90374144 | elapsed time per iteration (s): 15.21 | learning rate: 1.446E-05 | global batch size: 16 | lm loss: 5.833893E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2759/ 128728 | consumed samples: 44144 | consumed tokens: 90406912 | elapsed time per iteration (s): 15.23 | learning rate: 1.447E-05 | global batch size: 16 | lm loss: 6.127007E+00 | grad norm: 0.658 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2760/ 128728 | consumed samples: 44160 | consumed tokens: 90439680 | elapsed time per iteration (s): 15.21 | learning rate: 1.447E-05 | global batch size: 16 | lm loss: 6.115055E+00 | grad norm: 0.766 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2761/ 128728 | consumed samples: 44176 | consumed tokens: 90472448 | elapsed time per iteration (s): 15.24 | learning rate: 1.448E-05 | global batch size: 16 | lm loss: 6.209776E+00 | grad norm: 0.677 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2762/ 128728 | consumed samples: 44192 | consumed tokens: 90505216 | elapsed time per iteration (s): 15.20 | learning rate: 1.448E-05 | global batch size: 16 | lm loss: 6.111469E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2763/ 128728 | consumed samples: 44208 | consumed tokens: 90537984 | elapsed time per iteration (s): 15.25 | learning rate: 1.449E-05 | global batch size: 16 | lm loss: 6.213965E+00 | grad norm: 0.848 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2764/ 128728 | consumed samples: 44224 | consumed tokens: 90570752 | elapsed time per iteration (s): 15.25 | learning rate: 1.449E-05 | global batch size: 16 | lm loss: 5.969048E+00 | grad norm: 0.674 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2765/ 128728 | consumed samples: 44240 | consumed tokens: 90603520 | elapsed time per iteration (s): 15.24 | learning rate: 1.450E-05 | global batch size: 16 | lm loss: 6.218442E+00 | grad norm: 0.694 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2766/ 128728 | consumed samples: 44256 | consumed tokens: 90636288 | elapsed time per iteration (s): 15.21 | learning rate: 1.450E-05 | global batch size: 16 | lm loss: 6.126570E+00 | grad norm: 0.645 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2767/ 128728 | consumed samples: 44272 | consumed tokens: 90669056 | elapsed time per iteration (s): 15.16 | learning rate: 1.451E-05 | global batch size: 16 | lm loss: 6.056056E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 2768/ 128728 | consumed samples: 44288 | consumed tokens: 90701824 | elapsed time per iteration (s): 15.24 | learning rate: 1.451E-05 | global batch size: 16 | lm loss: 6.007943E+00 | grad norm: 0.912 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2769/ 128728 | consumed samples: 44304 | consumed tokens: 90734592 | elapsed time per iteration (s): 15.19 | learning rate: 1.452E-05 | global batch size: 16 | lm loss: 5.851771E+00 | grad norm: 0.813 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2770/ 128728 | consumed samples: 44320 | consumed tokens: 90767360 | elapsed time per iteration (s): 15.24 | learning rate: 1.452E-05 | global batch size: 16 | lm loss: 6.106419E+00 | grad norm: 0.755 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2771/ 128728 | consumed samples: 44336 | consumed tokens: 90800128 | elapsed time per iteration (s): 15.22 | learning rate: 1.453E-05 | global batch size: 16 | lm loss: 5.806401E+00 | grad norm: 0.929 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2772/ 128728 | consumed samples: 44352 | consumed tokens: 90832896 | elapsed time per iteration (s): 15.24 | learning rate: 1.453E-05 | global batch size: 16 | lm loss: 6.068120E+00 | grad norm: 0.679 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2773/ 128728 | consumed samples: 44368 | consumed tokens: 90865664 | elapsed time per iteration (s): 15.20 | learning rate: 1.454E-05 | global batch size: 16 | lm loss: 5.843704E+00 | grad norm: 0.823 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2774/ 128728 | consumed samples: 44384 | consumed tokens: 90898432 | elapsed time per iteration (s): 15.19 | learning rate: 1.454E-05 | global batch size: 16 | lm loss: 6.001309E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2775/ 128728 | consumed samples: 44400 | consumed tokens: 90931200 | elapsed time per iteration (s): 15.18 | learning rate: 1.455E-05 | global batch size: 16 | lm loss: 6.218292E+00 | grad norm: 0.666 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2776/ 128728 | consumed samples: 44416 | consumed tokens: 90963968 | elapsed time per iteration (s): 15.21 | learning rate: 1.455E-05 | global batch size: 16 | lm loss: 6.178038E+00 | grad norm: 0.662 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2777/ 128728 | consumed samples: 44432 | consumed tokens: 90996736 | elapsed time per iteration (s): 15.15 | learning rate: 1.456E-05 | global batch size: 16 | lm loss: 6.058540E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 2778/ 128728 | consumed samples: 44448 | consumed tokens: 91029504 | elapsed time per iteration (s): 15.22 | learning rate: 1.456E-05 | global batch size: 16 | lm loss: 6.073587E+00 | grad norm: 0.865 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2779/ 128728 | consumed samples: 44464 | consumed tokens: 91062272 | elapsed time per iteration (s): 15.21 | learning rate: 1.457E-05 | global batch size: 16 | lm loss: 6.025464E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2780/ 128728 | consumed samples: 44480 | consumed tokens: 91095040 | elapsed time per iteration (s): 15.24 | learning rate: 1.458E-05 | global batch size: 16 | lm loss: 6.045417E+00 | grad norm: 0.765 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2781/ 128728 | consumed samples: 44496 | consumed tokens: 91127808 | elapsed time per iteration (s): 15.22 | learning rate: 1.458E-05 | global batch size: 16 | lm loss: 5.972544E+00 | grad norm: 0.634 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2782/ 128728 | consumed samples: 44512 | consumed tokens: 91160576 | elapsed time per iteration (s): 15.22 | learning rate: 1.459E-05 | global batch size: 16 | lm loss: 6.040277E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2783/ 128728 | consumed samples: 44528 | consumed tokens: 91193344 | elapsed time per iteration (s): 15.23 | learning rate: 1.459E-05 | global batch size: 16 | lm loss: 6.183329E+00 | grad norm: 0.728 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2784/ 128728 | consumed samples: 44544 | consumed tokens: 91226112 | elapsed time per iteration (s): 15.20 | learning rate: 1.460E-05 | global batch size: 16 | lm loss: 5.996538E+00 | grad norm: 0.598 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2785/ 128728 | consumed samples: 44560 | consumed tokens: 91258880 | elapsed time per iteration (s): 15.22 | learning rate: 1.460E-05 | global batch size: 16 | lm loss: 6.052549E+00 | grad norm: 0.652 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2786/ 128728 | consumed samples: 44576 | consumed tokens: 91291648 | elapsed time per iteration (s): 15.24 | learning rate: 1.461E-05 | global batch size: 16 | lm loss: 6.023203E+00 | grad norm: 0.650 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2787/ 128728 | consumed samples: 44592 | consumed tokens: 91324416 | elapsed time per iteration (s): 15.23 | learning rate: 1.461E-05 | global batch size: 16 | lm loss: 5.934374E+00 | grad norm: 0.898 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2788/ 128728 | consumed samples: 44608 | consumed tokens: 91357184 | elapsed time per iteration (s): 15.17 | learning rate: 1.462E-05 | global batch size: 16 | lm loss: 5.979886E+00 | grad norm: 0.690 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 2789/ 128728 | consumed samples: 44624 | consumed tokens: 91389952 | elapsed time per iteration (s): 15.20 | learning rate: 1.462E-05 | global batch size: 16 | lm loss: 5.939224E+00 | grad norm: 0.661 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2790/ 128728 | consumed samples: 44640 | consumed tokens: 91422720 | elapsed time per iteration (s): 15.21 | learning rate: 1.463E-05 | global batch size: 16 | lm loss: 6.009098E+00 | grad norm: 0.808 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2791/ 128728 | consumed samples: 44656 | consumed tokens: 91455488 | elapsed time per iteration (s): 15.21 | learning rate: 1.463E-05 | global batch size: 16 | lm loss: 5.886978E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2792/ 128728 | consumed samples: 44672 | consumed tokens: 91488256 | elapsed time per iteration (s): 15.18 | learning rate: 1.464E-05 | global batch size: 16 | lm loss: 5.919722E+00 | grad norm: 0.682 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2793/ 128728 | consumed samples: 44688 | consumed tokens: 91521024 | elapsed time per iteration (s): 15.21 | learning rate: 1.464E-05 | global batch size: 16 | lm loss: 5.969708E+00 | grad norm: 0.708 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2794/ 128728 | consumed samples: 44704 | consumed tokens: 91553792 | elapsed time per iteration (s): 15.23 | learning rate: 1.465E-05 | global batch size: 16 | lm loss: 6.022653E+00 | grad norm: 0.881 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2795/ 128728 | consumed samples: 44720 | consumed tokens: 91586560 | elapsed time per iteration (s): 15.19 | learning rate: 1.465E-05 | global batch size: 16 | lm loss: 6.179086E+00 | grad norm: 1.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2796/ 128728 | consumed samples: 44736 | consumed tokens: 91619328 | elapsed time per iteration (s): 15.19 | learning rate: 1.466E-05 | global batch size: 16 | lm loss: 5.982589E+00 | grad norm: 0.684 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2797/ 128728 | consumed samples: 44752 | consumed tokens: 91652096 | elapsed time per iteration (s): 15.21 | learning rate: 1.466E-05 | global batch size: 16 | lm loss: 6.000892E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2798/ 128728 | consumed samples: 44768 | consumed tokens: 91684864 | elapsed time per iteration (s): 15.22 | learning rate: 1.467E-05 | global batch size: 16 | lm loss: 6.116832E+00 | grad norm: 0.661 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2799/ 128728 | consumed samples: 44784 | consumed tokens: 91717632 | elapsed time per iteration (s): 15.23 | learning rate: 1.467E-05 | global batch size: 16 | lm loss: 6.036739E+00 | grad norm: 0.720 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2800/ 128728 | consumed samples: 44800 | consumed tokens: 91750400 | elapsed time per iteration (s): 15.29 | learning rate: 1.468E-05 | global batch size: 16 | lm loss: 6.083531E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.046 | TFLOPs: 8.01 | [default7]: iteration 2801/ 128728 | consumed samples: 44816 | consumed tokens: 91783168 | elapsed time per iteration (s): 15.20 | learning rate: 1.469E-05 | global batch size: 16 | lm loss: 5.965879E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2802/ 128728 | consumed samples: 44832 | consumed tokens: 91815936 | elapsed time per iteration (s): 15.24 | learning rate: 1.469E-05 | global batch size: 16 | lm loss: 5.960813E+00 | grad norm: 0.833 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2803/ 128728 | consumed samples: 44848 | consumed tokens: 91848704 | elapsed time per iteration (s): 15.29 | learning rate: 1.470E-05 | global batch size: 16 | lm loss: 6.129034E+00 | grad norm: 0.831 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.01 | [default7]: iteration 2804/ 128728 | consumed samples: 44864 | consumed tokens: 91881472 | elapsed time per iteration (s): 15.23 | learning rate: 1.470E-05 | global batch size: 16 | lm loss: 6.187738E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2805/ 128728 | consumed samples: 44880 | consumed tokens: 91914240 | elapsed time per iteration (s): 15.19 | learning rate: 1.471E-05 | global batch size: 16 | lm loss: 5.718319E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2806/ 128728 | consumed samples: 44896 | consumed tokens: 91947008 | elapsed time per iteration (s): 15.25 | learning rate: 1.471E-05 | global batch size: 16 | lm loss: 5.992695E+00 | grad norm: 0.779 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.04 | [default7]: iteration 2807/ 128728 | consumed samples: 44912 | consumed tokens: 91979776 | elapsed time per iteration (s): 15.22 | learning rate: 1.472E-05 | global batch size: 16 | lm loss: 6.170292E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2808/ 128728 | consumed samples: 44928 | consumed tokens: 92012544 | elapsed time per iteration (s): 15.26 | learning rate: 1.472E-05 | global batch size: 16 | lm loss: 5.820615E+00 | grad norm: 1.288 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 2809/ 128728 | consumed samples: 44944 | consumed tokens: 92045312 | elapsed time per iteration (s): 15.29 | learning rate: 1.473E-05 | global batch size: 16 | lm loss: 6.132642E+00 | grad norm: 1.277 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.01 | [default7]: iteration 2810/ 128728 | consumed samples: 44960 | consumed tokens: 92078080 | elapsed time per iteration (s): 15.23 | learning rate: 1.473E-05 | global batch size: 16 | lm loss: 5.860527E+00 | grad norm: 0.757 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2811/ 128728 | consumed samples: 44976 | consumed tokens: 92110848 | elapsed time per iteration (s): 15.22 | learning rate: 1.474E-05 | global batch size: 16 | lm loss: 6.165506E+00 | grad norm: 1.233 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2812/ 128728 | consumed samples: 44992 | consumed tokens: 92143616 | elapsed time per iteration (s): 15.21 | learning rate: 1.474E-05 | global batch size: 16 | lm loss: 6.085719E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2813/ 128728 | consumed samples: 45008 | consumed tokens: 92176384 | elapsed time per iteration (s): 15.24 | learning rate: 1.475E-05 | global batch size: 16 | lm loss: 6.115023E+00 | grad norm: 0.728 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2814/ 128728 | consumed samples: 45024 | consumed tokens: 92209152 | elapsed time per iteration (s): 15.21 | learning rate: 1.475E-05 | global batch size: 16 | lm loss: 5.843146E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2815/ 128728 | consumed samples: 45040 | consumed tokens: 92241920 | elapsed time per iteration (s): 15.20 | learning rate: 1.476E-05 | global batch size: 16 | lm loss: 5.976727E+00 | grad norm: 0.844 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2816/ 128728 | consumed samples: 45056 | consumed tokens: 92274688 | elapsed time per iteration (s): 15.26 | learning rate: 1.476E-05 | global batch size: 16 | lm loss: 6.070988E+00 | grad norm: 1.025 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2817/ 128728 | consumed samples: 45072 | consumed tokens: 92307456 | elapsed time per iteration (s): 15.21 | learning rate: 1.477E-05 | global batch size: 16 | lm loss: 6.018933E+00 | grad norm: 0.795 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2818/ 128728 | consumed samples: 45088 | consumed tokens: 92340224 | elapsed time per iteration (s): 15.24 | learning rate: 1.477E-05 | global batch size: 16 | lm loss: 6.134639E+00 | grad norm: 0.666 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2819/ 128728 | consumed samples: 45104 | consumed tokens: 92372992 | elapsed time per iteration (s): 15.21 | learning rate: 1.478E-05 | global batch size: 16 | lm loss: 5.952758E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2820/ 128728 | consumed samples: 45120 | consumed tokens: 92405760 | elapsed time per iteration (s): 15.24 | learning rate: 1.478E-05 | global batch size: 16 | lm loss: 6.103268E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2821/ 128728 | consumed samples: 45136 | consumed tokens: 92438528 | elapsed time per iteration (s): 15.24 | learning rate: 1.479E-05 | global batch size: 16 | lm loss: 5.782512E+00 | grad norm: 1.492 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2822/ 128728 | consumed samples: 45152 | consumed tokens: 92471296 | elapsed time per iteration (s): 15.23 | learning rate: 1.480E-05 | global batch size: 16 | lm loss: 6.080799E+00 | grad norm: 0.847 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2823/ 128728 | consumed samples: 45168 | consumed tokens: 92504064 | elapsed time per iteration (s): 15.21 | learning rate: 1.480E-05 | global batch size: 16 | lm loss: 6.054215E+00 | grad norm: 0.728 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2824/ 128728 | consumed samples: 45184 | consumed tokens: 92536832 | elapsed time per iteration (s): 15.19 | learning rate: 1.481E-05 | global batch size: 16 | lm loss: 6.130510E+00 | grad norm: 0.850 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2825/ 128728 | consumed samples: 45200 | consumed tokens: 92569600 | elapsed time per iteration (s): 15.19 | learning rate: 1.481E-05 | global batch size: 16 | lm loss: 6.226121E+00 | grad norm: 0.895 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2826/ 128728 | consumed samples: 45216 | consumed tokens: 92602368 | elapsed time per iteration (s): 15.16 | learning rate: 1.482E-05 | global batch size: 16 | lm loss: 5.877883E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 2827/ 128728 | consumed samples: 45232 | consumed tokens: 92635136 | elapsed time per iteration (s): 15.23 | learning rate: 1.482E-05 | global batch size: 16 | lm loss: 5.866010E+00 | grad norm: 0.850 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2828/ 128728 | consumed samples: 45248 | consumed tokens: 92667904 | elapsed time per iteration (s): 15.21 | learning rate: 1.483E-05 | global batch size: 16 | lm loss: 6.033381E+00 | grad norm: 0.814 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2829/ 128728 | consumed samples: 45264 | consumed tokens: 92700672 | elapsed time per iteration (s): 15.22 | learning rate: 1.483E-05 | global batch size: 16 | lm loss: 6.256545E+00 | grad norm: 0.871 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2830/ 128728 | consumed samples: 45280 | consumed tokens: 92733440 | elapsed time per iteration (s): 15.21 | learning rate: 1.484E-05 | global batch size: 16 | lm loss: 5.981166E+00 | grad norm: 0.864 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2831/ 128728 | consumed samples: 45296 | consumed tokens: 92766208 | elapsed time per iteration (s): 15.24 | learning rate: 1.484E-05 | global batch size: 16 | lm loss: 6.093549E+00 | grad norm: 0.815 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2832/ 128728 | consumed samples: 45312 | consumed tokens: 92798976 | elapsed time per iteration (s): 15.21 | learning rate: 1.485E-05 | global batch size: 16 | lm loss: 5.899080E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2833/ 128728 | consumed samples: 45328 | consumed tokens: 92831744 | elapsed time per iteration (s): 15.21 | learning rate: 1.485E-05 | global batch size: 16 | lm loss: 6.259049E+00 | grad norm: 0.765 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2834/ 128728 | consumed samples: 45344 | consumed tokens: 92864512 | elapsed time per iteration (s): 15.23 | learning rate: 1.486E-05 | global batch size: 16 | lm loss: 5.930161E+00 | grad norm: 0.676 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2835/ 128728 | consumed samples: 45360 | consumed tokens: 92897280 | elapsed time per iteration (s): 15.20 | learning rate: 1.486E-05 | global batch size: 16 | lm loss: 6.179988E+00 | grad norm: 0.994 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2836/ 128728 | consumed samples: 45376 | consumed tokens: 92930048 | elapsed time per iteration (s): 15.24 | learning rate: 1.487E-05 | global batch size: 16 | lm loss: 5.902924E+00 | grad norm: 0.800 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2837/ 128728 | consumed samples: 45392 | consumed tokens: 92962816 | elapsed time per iteration (s): 15.23 | learning rate: 1.487E-05 | global batch size: 16 | lm loss: 5.806733E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2838/ 128728 | consumed samples: 45408 | consumed tokens: 92995584 | elapsed time per iteration (s): 15.22 | learning rate: 1.488E-05 | global batch size: 16 | lm loss: 5.926982E+00 | grad norm: 1.011 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2839/ 128728 | consumed samples: 45424 | consumed tokens: 93028352 | elapsed time per iteration (s): 15.24 | learning rate: 1.488E-05 | global batch size: 16 | lm loss: 5.809728E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2840/ 128728 | consumed samples: 45440 | consumed tokens: 93061120 | elapsed time per iteration (s): 15.24 | learning rate: 1.489E-05 | global batch size: 16 | lm loss: 5.952487E+00 | grad norm: 0.688 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2841/ 128728 | consumed samples: 45456 | consumed tokens: 93093888 | elapsed time per iteration (s): 15.23 | learning rate: 1.490E-05 | global batch size: 16 | lm loss: 6.089927E+00 | grad norm: 0.930 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2842/ 128728 | consumed samples: 45472 | consumed tokens: 93126656 | elapsed time per iteration (s): 15.22 | learning rate: 1.490E-05 | global batch size: 16 | lm loss: 5.907791E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2843/ 128728 | consumed samples: 45488 | consumed tokens: 93159424 | elapsed time per iteration (s): 15.26 | learning rate: 1.491E-05 | global batch size: 16 | lm loss: 5.926930E+00 | grad norm: 0.647 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2844/ 128728 | consumed samples: 45504 | consumed tokens: 93192192 | elapsed time per iteration (s): 15.23 | learning rate: 1.491E-05 | global batch size: 16 | lm loss: 5.910907E+00 | grad norm: 0.802 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2845/ 128728 | consumed samples: 45520 | consumed tokens: 93224960 | elapsed time per iteration (s): 15.16 | learning rate: 1.492E-05 | global batch size: 16 | lm loss: 6.070807E+00 | grad norm: 0.936 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2846/ 128728 | consumed samples: 45536 | consumed tokens: 93257728 | elapsed time per iteration (s): 15.22 | learning rate: 1.492E-05 | global batch size: 16 | lm loss: 5.915307E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2847/ 128728 | consumed samples: 45552 | consumed tokens: 93290496 | elapsed time per iteration (s): 15.22 | learning rate: 1.493E-05 | global batch size: 16 | lm loss: 6.010011E+00 | grad norm: 0.826 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2848/ 128728 | consumed samples: 45568 | consumed tokens: 93323264 | elapsed time per iteration (s): 15.23 | learning rate: 1.493E-05 | global batch size: 16 | lm loss: 5.922984E+00 | grad norm: 0.703 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2849/ 128728 | consumed samples: 45584 | consumed tokens: 93356032 | elapsed time per iteration (s): 15.19 | learning rate: 1.494E-05 | global batch size: 16 | lm loss: 6.111296E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2850/ 128728 | consumed samples: 45600 | consumed tokens: 93388800 | elapsed time per iteration (s): 15.22 | learning rate: 1.494E-05 | global batch size: 16 | lm loss: 6.022770E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2851/ 128728 | consumed samples: 45616 | consumed tokens: 93421568 | elapsed time per iteration (s): 15.22 | learning rate: 1.495E-05 | global batch size: 16 | lm loss: 5.970350E+00 | grad norm: 0.673 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2852/ 128728 | consumed samples: 45632 | consumed tokens: 93454336 | elapsed time per iteration (s): 15.20 | learning rate: 1.495E-05 | global batch size: 16 | lm loss: 6.093951E+00 | grad norm: 0.848 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2853/ 128728 | consumed samples: 45648 | consumed tokens: 93487104 | elapsed time per iteration (s): 15.24 | learning rate: 1.496E-05 | global batch size: 16 | lm loss: 5.879686E+00 | grad norm: 1.039 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2854/ 128728 | consumed samples: 45664 | consumed tokens: 93519872 | elapsed time per iteration (s): 15.19 | learning rate: 1.496E-05 | global batch size: 16 | lm loss: 5.586125E+00 | grad norm: 0.899 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 2855/ 128728 | consumed samples: 45680 | consumed tokens: 93552640 | elapsed time per iteration (s): 15.22 | learning rate: 1.497E-05 | global batch size: 16 | lm loss: 5.921970E+00 | grad norm: 0.782 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2856/ 128728 | consumed samples: 45696 | consumed tokens: 93585408 | elapsed time per iteration (s): 15.13 | learning rate: 1.497E-05 | global batch size: 16 | lm loss: 5.962622E+00 | grad norm: 0.721 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 2857/ 128728 | consumed samples: 45712 | consumed tokens: 93618176 | elapsed time per iteration (s): 15.22 | learning rate: 1.498E-05 | global batch size: 16 | lm loss: 6.157983E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2858/ 128728 | consumed samples: 45728 | consumed tokens: 93650944 | elapsed time per iteration (s): 15.21 | learning rate: 1.498E-05 | global batch size: 16 | lm loss: 5.974092E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2859/ 128728 | consumed samples: 45744 | consumed tokens: 93683712 | elapsed time per iteration (s): 15.21 | learning rate: 1.499E-05 | global batch size: 16 | lm loss: 5.760711E+00 | grad norm: 0.667 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2860/ 128728 | consumed samples: 45760 | consumed tokens: 93716480 | elapsed time per iteration (s): 15.20 | learning rate: 1.499E-05 | global batch size: 16 | lm loss: 6.026981E+00 | grad norm: 0.776 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2861/ 128728 | consumed samples: 45776 | consumed tokens: 93749248 | elapsed time per iteration (s): 15.20 | learning rate: 1.500E-05 | global batch size: 16 | lm loss: 5.793530E+00 | grad norm: 0.692 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2862/ 128728 | consumed samples: 45792 | consumed tokens: 93782016 | elapsed time per iteration (s): 15.22 | learning rate: 1.501E-05 | global batch size: 16 | lm loss: 5.890173E+00 | grad norm: 0.823 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2863/ 128728 | consumed samples: 45808 | consumed tokens: 93814784 | elapsed time per iteration (s): 15.22 | learning rate: 1.501E-05 | global batch size: 16 | lm loss: 6.015519E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2864/ 128728 | consumed samples: 45824 | consumed tokens: 93847552 | elapsed time per iteration (s): 15.24 | learning rate: 1.502E-05 | global batch size: 16 | lm loss: 6.149529E+00 | grad norm: 0.657 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2865/ 128728 | consumed samples: 45840 | consumed tokens: 93880320 | elapsed time per iteration (s): 15.18 | learning rate: 1.502E-05 | global batch size: 16 | lm loss: 6.066201E+00 | grad norm: 0.728 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2866/ 128728 | consumed samples: 45856 | consumed tokens: 93913088 | elapsed time per iteration (s): 15.25 | learning rate: 1.503E-05 | global batch size: 16 | lm loss: 6.205139E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2867/ 128728 | consumed samples: 45872 | consumed tokens: 93945856 | elapsed time per iteration (s): 15.25 | learning rate: 1.503E-05 | global batch size: 16 | lm loss: 6.108381E+00 | grad norm: 0.631 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2868/ 128728 | consumed samples: 45888 | consumed tokens: 93978624 | elapsed time per iteration (s): 15.17 | learning rate: 1.504E-05 | global batch size: 16 | lm loss: 5.996854E+00 | grad norm: 0.868 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2869/ 128728 | consumed samples: 45904 | consumed tokens: 94011392 | elapsed time per iteration (s): 15.18 | learning rate: 1.504E-05 | global batch size: 16 | lm loss: 5.922822E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2870/ 128728 | consumed samples: 45920 | consumed tokens: 94044160 | elapsed time per iteration (s): 15.22 | learning rate: 1.505E-05 | global batch size: 16 | lm loss: 6.114247E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2871/ 128728 | consumed samples: 45936 | consumed tokens: 94076928 | elapsed time per iteration (s): 15.19 | learning rate: 1.505E-05 | global batch size: 16 | lm loss: 6.018162E+00 | grad norm: 0.670 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2872/ 128728 | consumed samples: 45952 | consumed tokens: 94109696 | elapsed time per iteration (s): 15.21 | learning rate: 1.506E-05 | global batch size: 16 | lm loss: 5.803544E+00 | grad norm: 0.727 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2873/ 128728 | consumed samples: 45968 | consumed tokens: 94142464 | elapsed time per iteration (s): 15.21 | learning rate: 1.506E-05 | global batch size: 16 | lm loss: 5.869973E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2874/ 128728 | consumed samples: 45984 | consumed tokens: 94175232 | elapsed time per iteration (s): 15.22 | learning rate: 1.507E-05 | global batch size: 16 | lm loss: 6.040289E+00 | grad norm: 0.710 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2875/ 128728 | consumed samples: 46000 | consumed tokens: 94208000 | elapsed time per iteration (s): 15.20 | learning rate: 1.507E-05 | global batch size: 16 | lm loss: 5.794731E+00 | grad norm: 0.657 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2876/ 128728 | consumed samples: 46016 | consumed tokens: 94240768 | elapsed time per iteration (s): 15.19 | learning rate: 1.508E-05 | global batch size: 16 | lm loss: 6.144478E+00 | grad norm: 0.866 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2877/ 128728 | consumed samples: 46032 | consumed tokens: 94273536 | elapsed time per iteration (s): 15.23 | learning rate: 1.508E-05 | global batch size: 16 | lm loss: 5.903439E+00 | grad norm: 0.629 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2878/ 128728 | consumed samples: 46048 | consumed tokens: 94306304 | elapsed time per iteration (s): 15.23 | learning rate: 1.509E-05 | global batch size: 16 | lm loss: 5.949089E+00 | grad norm: 0.703 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2879/ 128728 | consumed samples: 46064 | consumed tokens: 94339072 | elapsed time per iteration (s): 15.18 | learning rate: 1.509E-05 | global batch size: 16 | lm loss: 5.951438E+00 | grad norm: 0.642 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2880/ 128728 | consumed samples: 46080 | consumed tokens: 94371840 | elapsed time per iteration (s): 15.18 | learning rate: 1.510E-05 | global batch size: 16 | lm loss: 5.964561E+00 | grad norm: 0.898 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2881/ 128728 | consumed samples: 46096 | consumed tokens: 94404608 | elapsed time per iteration (s): 15.22 | learning rate: 1.510E-05 | global batch size: 16 | lm loss: 5.830142E+00 | grad norm: 0.687 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2882/ 128728 | consumed samples: 46112 | consumed tokens: 94437376 | elapsed time per iteration (s): 15.22 | learning rate: 1.511E-05 | global batch size: 16 | lm loss: 6.060420E+00 | grad norm: 0.665 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2883/ 128728 | consumed samples: 46128 | consumed tokens: 94470144 | elapsed time per iteration (s): 15.17 | learning rate: 1.512E-05 | global batch size: 16 | lm loss: 5.988078E+00 | grad norm: 0.642 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2884/ 128728 | consumed samples: 46144 | consumed tokens: 94502912 | elapsed time per iteration (s): 15.24 | learning rate: 1.512E-05 | global batch size: 16 | lm loss: 6.100799E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2885/ 128728 | consumed samples: 46160 | consumed tokens: 94535680 | elapsed time per iteration (s): 15.16 | learning rate: 1.513E-05 | global batch size: 16 | lm loss: 6.090507E+00 | grad norm: 1.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 2886/ 128728 | consumed samples: 46176 | consumed tokens: 94568448 | elapsed time per iteration (s): 15.22 | learning rate: 1.513E-05 | global batch size: 16 | lm loss: 5.873533E+00 | grad norm: 0.739 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2887/ 128728 | consumed samples: 46192 | consumed tokens: 94601216 | elapsed time per iteration (s): 15.22 | learning rate: 1.514E-05 | global batch size: 16 | lm loss: 5.988422E+00 | grad norm: 0.804 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2888/ 128728 | consumed samples: 46208 | consumed tokens: 94633984 | elapsed time per iteration (s): 15.26 | learning rate: 1.514E-05 | global batch size: 16 | lm loss: 5.750258E+00 | grad norm: 1.020 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2889/ 128728 | consumed samples: 46224 | consumed tokens: 94666752 | elapsed time per iteration (s): 15.22 | learning rate: 1.515E-05 | global batch size: 16 | lm loss: 5.921540E+00 | grad norm: 0.707 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2890/ 128728 | consumed samples: 46240 | consumed tokens: 94699520 | elapsed time per iteration (s): 15.21 | learning rate: 1.515E-05 | global batch size: 16 | lm loss: 6.116239E+00 | grad norm: 0.683 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2891/ 128728 | consumed samples: 46256 | consumed tokens: 94732288 | elapsed time per iteration (s): 15.19 | learning rate: 1.516E-05 | global batch size: 16 | lm loss: 6.022903E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2892/ 128728 | consumed samples: 46272 | consumed tokens: 94765056 | elapsed time per iteration (s): 15.25 | learning rate: 1.516E-05 | global batch size: 16 | lm loss: 6.116355E+00 | grad norm: 0.802 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2893/ 128728 | consumed samples: 46288 | consumed tokens: 94797824 | elapsed time per iteration (s): 15.22 | learning rate: 1.517E-05 | global batch size: 16 | lm loss: 5.981586E+00 | grad norm: 0.692 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2894/ 128728 | consumed samples: 46304 | consumed tokens: 94830592 | elapsed time per iteration (s): 15.24 | learning rate: 1.517E-05 | global batch size: 16 | lm loss: 6.004777E+00 | grad norm: 0.802 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2895/ 128728 | consumed samples: 46320 | consumed tokens: 94863360 | elapsed time per iteration (s): 15.20 | learning rate: 1.518E-05 | global batch size: 16 | lm loss: 6.011148E+00 | grad norm: 0.814 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2896/ 128728 | consumed samples: 46336 | consumed tokens: 94896128 | elapsed time per iteration (s): 15.20 | learning rate: 1.518E-05 | global batch size: 16 | lm loss: 5.884268E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2897/ 128728 | consumed samples: 46352 | consumed tokens: 94928896 | elapsed time per iteration (s): 15.26 | learning rate: 1.519E-05 | global batch size: 16 | lm loss: 5.814329E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2898/ 128728 | consumed samples: 46368 | consumed tokens: 94961664 | elapsed time per iteration (s): 15.23 | learning rate: 1.519E-05 | global batch size: 16 | lm loss: 6.280364E+00 | grad norm: 0.698 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2899/ 128728 | consumed samples: 46384 | consumed tokens: 94994432 | elapsed time per iteration (s): 15.22 | learning rate: 1.520E-05 | global batch size: 16 | lm loss: 5.785411E+00 | grad norm: 0.650 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2900/ 128728 | consumed samples: 46400 | consumed tokens: 95027200 | elapsed time per iteration (s): 15.20 | learning rate: 1.520E-05 | global batch size: 16 | lm loss: 6.041264E+00 | grad norm: 0.762 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2901/ 128728 | consumed samples: 46416 | consumed tokens: 95059968 | elapsed time per iteration (s): 15.22 | learning rate: 1.521E-05 | global batch size: 16 | lm loss: 5.860376E+00 | grad norm: 0.796 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2902/ 128728 | consumed samples: 46432 | consumed tokens: 95092736 | elapsed time per iteration (s): 15.26 | learning rate: 1.521E-05 | global batch size: 16 | lm loss: 5.820327E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 2903/ 128728 | consumed samples: 46448 | consumed tokens: 95125504 | elapsed time per iteration (s): 15.21 | learning rate: 1.522E-05 | global batch size: 16 | lm loss: 5.791872E+00 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2904/ 128728 | consumed samples: 46464 | consumed tokens: 95158272 | elapsed time per iteration (s): 15.22 | learning rate: 1.523E-05 | global batch size: 16 | lm loss: 5.807111E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2905/ 128728 | consumed samples: 46480 | consumed tokens: 95191040 | elapsed time per iteration (s): 15.18 | learning rate: 1.523E-05 | global batch size: 16 | lm loss: 5.866320E+00 | grad norm: 0.741 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2906/ 128728 | consumed samples: 46496 | consumed tokens: 95223808 | elapsed time per iteration (s): 15.25 | learning rate: 1.524E-05 | global batch size: 16 | lm loss: 6.055687E+00 | grad norm: 0.765 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2907/ 128728 | consumed samples: 46512 | consumed tokens: 95256576 | elapsed time per iteration (s): 15.22 | learning rate: 1.524E-05 | global batch size: 16 | lm loss: 5.993578E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2908/ 128728 | consumed samples: 46528 | consumed tokens: 95289344 | elapsed time per iteration (s): 15.20 | learning rate: 1.525E-05 | global batch size: 16 | lm loss: 6.036336E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2909/ 128728 | consumed samples: 46544 | consumed tokens: 95322112 | elapsed time per iteration (s): 15.23 | learning rate: 1.525E-05 | global batch size: 16 | lm loss: 5.817921E+00 | grad norm: 0.678 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2910/ 128728 | consumed samples: 46560 | consumed tokens: 95354880 | elapsed time per iteration (s): 15.24 | learning rate: 1.526E-05 | global batch size: 16 | lm loss: 6.041966E+00 | grad norm: 0.965 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2911/ 128728 | consumed samples: 46576 | consumed tokens: 95387648 | elapsed time per iteration (s): 15.24 | learning rate: 1.526E-05 | global batch size: 16 | lm loss: 5.893199E+00 | grad norm: 0.729 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2912/ 128728 | consumed samples: 46592 | consumed tokens: 95420416 | elapsed time per iteration (s): 15.22 | learning rate: 1.527E-05 | global batch size: 16 | lm loss: 5.920829E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2913/ 128728 | consumed samples: 46608 | consumed tokens: 95453184 | elapsed time per iteration (s): 15.21 | learning rate: 1.527E-05 | global batch size: 16 | lm loss: 6.020864E+00 | grad norm: 0.999 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2914/ 128728 | consumed samples: 46624 | consumed tokens: 95485952 | elapsed time per iteration (s): 15.23 | learning rate: 1.528E-05 | global batch size: 16 | lm loss: 5.852686E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2915/ 128728 | consumed samples: 46640 | consumed tokens: 95518720 | elapsed time per iteration (s): 15.28 | learning rate: 1.528E-05 | global batch size: 16 | lm loss: 6.035823E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 2916/ 128728 | consumed samples: 46656 | consumed tokens: 95551488 | elapsed time per iteration (s): 15.20 | learning rate: 1.529E-05 | global batch size: 16 | lm loss: 5.785281E+00 | grad norm: 0.682 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2917/ 128728 | consumed samples: 46672 | consumed tokens: 95584256 | elapsed time per iteration (s): 15.23 | learning rate: 1.529E-05 | global batch size: 16 | lm loss: 5.900357E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2918/ 128728 | consumed samples: 46688 | consumed tokens: 95617024 | elapsed time per iteration (s): 15.23 | learning rate: 1.530E-05 | global batch size: 16 | lm loss: 6.010538E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2919/ 128728 | consumed samples: 46704 | consumed tokens: 95649792 | elapsed time per iteration (s): 15.24 | learning rate: 1.530E-05 | global batch size: 16 | lm loss: 5.867478E+00 | grad norm: 0.659 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2920/ 128728 | consumed samples: 46720 | consumed tokens: 95682560 | elapsed time per iteration (s): 15.25 | learning rate: 1.531E-05 | global batch size: 16 | lm loss: 5.778384E+00 | grad norm: 0.677 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.04 | [default7]: iteration 2921/ 128728 | consumed samples: 46736 | consumed tokens: 95715328 | elapsed time per iteration (s): 15.20 | learning rate: 1.531E-05 | global batch size: 16 | lm loss: 5.962376E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2922/ 128728 | consumed samples: 46752 | consumed tokens: 95748096 | elapsed time per iteration (s): 15.21 | learning rate: 1.532E-05 | global batch size: 16 | lm loss: 5.962127E+00 | grad norm: 0.785 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2923/ 128728 | consumed samples: 46768 | consumed tokens: 95780864 | elapsed time per iteration (s): 15.14 | learning rate: 1.532E-05 | global batch size: 16 | lm loss: 5.935369E+00 | grad norm: 0.673 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 2924/ 128728 | consumed samples: 46784 | consumed tokens: 95813632 | elapsed time per iteration (s): 15.23 | learning rate: 1.533E-05 | global batch size: 16 | lm loss: 5.939228E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2925/ 128728 | consumed samples: 46800 | consumed tokens: 95846400 | elapsed time per iteration (s): 15.19 | learning rate: 1.534E-05 | global batch size: 16 | lm loss: 5.899131E+00 | grad norm: 0.668 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2926/ 128728 | consumed samples: 46816 | consumed tokens: 95879168 | elapsed time per iteration (s): 15.20 | learning rate: 1.534E-05 | global batch size: 16 | lm loss: 5.991677E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2927/ 128728 | consumed samples: 46832 | consumed tokens: 95911936 | elapsed time per iteration (s): 15.23 | learning rate: 1.535E-05 | global batch size: 16 | lm loss: 6.101864E+00 | grad norm: 0.677 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2928/ 128728 | consumed samples: 46848 | consumed tokens: 95944704 | elapsed time per iteration (s): 15.23 | learning rate: 1.535E-05 | global batch size: 16 | lm loss: 5.901472E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 2929/ 128728 | consumed samples: 46864 | consumed tokens: 95977472 | elapsed time per iteration (s): 15.22 | learning rate: 1.536E-05 | global batch size: 16 | lm loss: 6.057093E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2930/ 128728 | consumed samples: 46880 | consumed tokens: 96010240 | elapsed time per iteration (s): 15.22 | learning rate: 1.536E-05 | global batch size: 16 | lm loss: 5.913117E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2931/ 128728 | consumed samples: 46896 | consumed tokens: 96043008 | elapsed time per iteration (s): 15.20 | learning rate: 1.537E-05 | global batch size: 16 | lm loss: 5.945035E+00 | grad norm: 0.701 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2932/ 128728 | consumed samples: 46912 | consumed tokens: 96075776 | elapsed time per iteration (s): 15.21 | learning rate: 1.537E-05 | global batch size: 16 | lm loss: 5.830423E+00 | grad norm: 0.711 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2933/ 128728 | consumed samples: 46928 | consumed tokens: 96108544 | elapsed time per iteration (s): 15.22 | learning rate: 1.538E-05 | global batch size: 16 | lm loss: 6.088906E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2934/ 128728 | consumed samples: 46944 | consumed tokens: 96141312 | elapsed time per iteration (s): 15.28 | learning rate: 1.538E-05 | global batch size: 16 | lm loss: 5.862062E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 2935/ 128728 | consumed samples: 46960 | consumed tokens: 96174080 | elapsed time per iteration (s): 15.22 | learning rate: 1.539E-05 | global batch size: 16 | lm loss: 5.764572E+00 | grad norm: 0.651 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2936/ 128728 | consumed samples: 46976 | consumed tokens: 96206848 | elapsed time per iteration (s): 15.23 | learning rate: 1.539E-05 | global batch size: 16 | lm loss: 5.989824E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2937/ 128728 | consumed samples: 46992 | consumed tokens: 96239616 | elapsed time per iteration (s): 15.27 | learning rate: 1.540E-05 | global batch size: 16 | lm loss: 5.880247E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 2938/ 128728 | consumed samples: 47008 | consumed tokens: 96272384 | elapsed time per iteration (s): 15.19 | learning rate: 1.540E-05 | global batch size: 16 | lm loss: 5.923770E+00 | grad norm: 0.645 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2939/ 128728 | consumed samples: 47024 | consumed tokens: 96305152 | elapsed time per iteration (s): 15.22 | learning rate: 1.541E-05 | global batch size: 16 | lm loss: 5.879602E+00 | grad norm: 0.654 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2940/ 128728 | consumed samples: 47040 | consumed tokens: 96337920 | elapsed time per iteration (s): 15.23 | learning rate: 1.541E-05 | global batch size: 16 | lm loss: 5.848747E+00 | grad norm: 0.668 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2941/ 128728 | consumed samples: 47056 | consumed tokens: 96370688 | elapsed time per iteration (s): 15.22 | learning rate: 1.542E-05 | global batch size: 16 | lm loss: 5.908345E+00 | grad norm: 0.626 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2942/ 128728 | consumed samples: 47072 | consumed tokens: 96403456 | elapsed time per iteration (s): 15.25 | learning rate: 1.542E-05 | global batch size: 16 | lm loss: 5.707866E+00 | grad norm: 0.669 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2943/ 128728 | consumed samples: 47088 | consumed tokens: 96436224 | elapsed time per iteration (s): 15.22 | learning rate: 1.543E-05 | global batch size: 16 | lm loss: 6.033948E+00 | grad norm: 0.640 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2944/ 128728 | consumed samples: 47104 | consumed tokens: 96468992 | elapsed time per iteration (s): 15.20 | learning rate: 1.544E-05 | global batch size: 16 | lm loss: 5.967467E+00 | grad norm: 0.652 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2945/ 128728 | consumed samples: 47120 | consumed tokens: 96501760 | elapsed time per iteration (s): 15.20 | learning rate: 1.544E-05 | global batch size: 16 | lm loss: 5.921725E+00 | grad norm: 0.934 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2946/ 128728 | consumed samples: 47136 | consumed tokens: 96534528 | elapsed time per iteration (s): 15.22 | learning rate: 1.545E-05 | global batch size: 16 | lm loss: 5.984942E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2947/ 128728 | consumed samples: 47152 | consumed tokens: 96567296 | elapsed time per iteration (s): 15.22 | learning rate: 1.545E-05 | global batch size: 16 | lm loss: 5.708416E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2948/ 128728 | consumed samples: 47168 | consumed tokens: 96600064 | elapsed time per iteration (s): 15.26 | learning rate: 1.546E-05 | global batch size: 16 | lm loss: 5.940567E+00 | grad norm: 0.711 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 2949/ 128728 | consumed samples: 47184 | consumed tokens: 96632832 | elapsed time per iteration (s): 15.21 | learning rate: 1.546E-05 | global batch size: 16 | lm loss: 5.731608E+00 | grad norm: 0.701 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2950/ 128728 | consumed samples: 47200 | consumed tokens: 96665600 | elapsed time per iteration (s): 15.19 | learning rate: 1.547E-05 | global batch size: 16 | lm loss: 5.956516E+00 | grad norm: 0.964 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2951/ 128728 | consumed samples: 47216 | consumed tokens: 96698368 | elapsed time per iteration (s): 15.26 | learning rate: 1.547E-05 | global batch size: 16 | lm loss: 6.100035E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 2952/ 128728 | consumed samples: 47232 | consumed tokens: 96731136 | elapsed time per iteration (s): 15.23 | learning rate: 1.548E-05 | global batch size: 16 | lm loss: 5.803092E+00 | grad norm: 0.650 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2953/ 128728 | consumed samples: 47248 | consumed tokens: 96763904 | elapsed time per iteration (s): 15.23 | learning rate: 1.548E-05 | global batch size: 16 | lm loss: 5.983268E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2954/ 128728 | consumed samples: 47264 | consumed tokens: 96796672 | elapsed time per iteration (s): 15.20 | learning rate: 1.549E-05 | global batch size: 16 | lm loss: 5.938457E+00 | grad norm: 0.657 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2955/ 128728 | consumed samples: 47280 | consumed tokens: 96829440 | elapsed time per iteration (s): 15.25 | learning rate: 1.549E-05 | global batch size: 16 | lm loss: 5.933385E+00 | grad norm: 1.784 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2956/ 128728 | consumed samples: 47296 | consumed tokens: 96862208 | elapsed time per iteration (s): 15.26 | learning rate: 1.550E-05 | global batch size: 16 | lm loss: 5.850451E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 2957/ 128728 | consumed samples: 47312 | consumed tokens: 96894976 | elapsed time per iteration (s): 15.20 | learning rate: 1.550E-05 | global batch size: 16 | lm loss: 5.800276E+00 | grad norm: 0.645 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2958/ 128728 | consumed samples: 47328 | consumed tokens: 96927744 | elapsed time per iteration (s): 15.17 | learning rate: 1.551E-05 | global batch size: 16 | lm loss: 6.125942E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2959/ 128728 | consumed samples: 47344 | consumed tokens: 96960512 | elapsed time per iteration (s): 15.22 | learning rate: 1.551E-05 | global batch size: 16 | lm loss: 5.967272E+00 | grad norm: 1.568 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2960/ 128728 | consumed samples: 47360 | consumed tokens: 96993280 | elapsed time per iteration (s): 15.23 | learning rate: 1.552E-05 | global batch size: 16 | lm loss: 6.135997E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2961/ 128728 | consumed samples: 47376 | consumed tokens: 97026048 | elapsed time per iteration (s): 15.17 | learning rate: 1.552E-05 | global batch size: 16 | lm loss: 6.001085E+00 | grad norm: 0.665 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 2962/ 128728 | consumed samples: 47392 | consumed tokens: 97058816 | elapsed time per iteration (s): 15.25 | learning rate: 1.553E-05 | global batch size: 16 | lm loss: 6.062928E+00 | grad norm: 0.750 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.04 | [default7]: iteration 2963/ 128728 | consumed samples: 47408 | consumed tokens: 97091584 | elapsed time per iteration (s): 15.23 | learning rate: 1.553E-05 | global batch size: 16 | lm loss: 6.055041E+00 | grad norm: 1.352 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2964/ 128728 | consumed samples: 47424 | consumed tokens: 97124352 | elapsed time per iteration (s): 15.23 | learning rate: 1.554E-05 | global batch size: 16 | lm loss: 5.878264E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2965/ 128728 | consumed samples: 47440 | consumed tokens: 97157120 | elapsed time per iteration (s): 15.19 | learning rate: 1.555E-05 | global batch size: 16 | lm loss: 6.206885E+00 | grad norm: 1.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2966/ 128728 | consumed samples: 47456 | consumed tokens: 97189888 | elapsed time per iteration (s): 15.21 | learning rate: 1.555E-05 | global batch size: 16 | lm loss: 6.068411E+00 | grad norm: 0.780 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2967/ 128728 | consumed samples: 47472 | consumed tokens: 97222656 | elapsed time per iteration (s): 15.20 | learning rate: 1.556E-05 | global batch size: 16 | lm loss: 5.927691E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2968/ 128728 | consumed samples: 47488 | consumed tokens: 97255424 | elapsed time per iteration (s): 15.26 | learning rate: 1.556E-05 | global batch size: 16 | lm loss: 6.127417E+00 | grad norm: 1.294 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2969/ 128728 | consumed samples: 47504 | consumed tokens: 97288192 | elapsed time per iteration (s): 15.24 | learning rate: 1.557E-05 | global batch size: 16 | lm loss: 6.099837E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2970/ 128728 | consumed samples: 47520 | consumed tokens: 97320960 | elapsed time per iteration (s): 15.19 | learning rate: 1.557E-05 | global batch size: 16 | lm loss: 5.764379E+00 | grad norm: 0.739 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2971/ 128728 | consumed samples: 47536 | consumed tokens: 97353728 | elapsed time per iteration (s): 15.22 | learning rate: 1.558E-05 | global batch size: 16 | lm loss: 5.941983E+00 | grad norm: 0.698 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2972/ 128728 | consumed samples: 47552 | consumed tokens: 97386496 | elapsed time per iteration (s): 15.23 | learning rate: 1.558E-05 | global batch size: 16 | lm loss: 5.736744E+00 | grad norm: 0.887 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2973/ 128728 | consumed samples: 47568 | consumed tokens: 97419264 | elapsed time per iteration (s): 15.24 | learning rate: 1.559E-05 | global batch size: 16 | lm loss: 5.593853E+00 | grad norm: 0.634 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2974/ 128728 | consumed samples: 47584 | consumed tokens: 97452032 | elapsed time per iteration (s): 15.24 | learning rate: 1.559E-05 | global batch size: 16 | lm loss: 5.908027E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2975/ 128728 | consumed samples: 47600 | consumed tokens: 97484800 | elapsed time per iteration (s): 15.17 | learning rate: 1.560E-05 | global batch size: 16 | lm loss: 5.938254E+00 | grad norm: 0.844 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2976/ 128728 | consumed samples: 47616 | consumed tokens: 97517568 | elapsed time per iteration (s): 15.22 | learning rate: 1.560E-05 | global batch size: 16 | lm loss: 5.775309E+00 | grad norm: 0.720 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2977/ 128728 | consumed samples: 47632 | consumed tokens: 97550336 | elapsed time per iteration (s): 15.23 | learning rate: 1.561E-05 | global batch size: 16 | lm loss: 6.102681E+00 | grad norm: 1.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2978/ 128728 | consumed samples: 47648 | consumed tokens: 97583104 | elapsed time per iteration (s): 15.26 | learning rate: 1.561E-05 | global batch size: 16 | lm loss: 5.797580E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 2979/ 128728 | consumed samples: 47664 | consumed tokens: 97615872 | elapsed time per iteration (s): 15.21 | learning rate: 1.562E-05 | global batch size: 16 | lm loss: 5.752298E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2980/ 128728 | consumed samples: 47680 | consumed tokens: 97648640 | elapsed time per iteration (s): 15.22 | learning rate: 1.562E-05 | global batch size: 16 | lm loss: 6.039430E+00 | grad norm: 1.642 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2981/ 128728 | consumed samples: 47696 | consumed tokens: 97681408 | elapsed time per iteration (s): 15.22 | learning rate: 1.563E-05 | global batch size: 16 | lm loss: 6.008101E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2982/ 128728 | consumed samples: 47712 | consumed tokens: 97714176 | elapsed time per iteration (s): 15.18 | learning rate: 1.563E-05 | global batch size: 16 | lm loss: 5.872960E+00 | grad norm: 0.779 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 2983/ 128728 | consumed samples: 47728 | consumed tokens: 97746944 | elapsed time per iteration (s): 15.19 | learning rate: 1.564E-05 | global batch size: 16 | lm loss: 6.110078E+00 | grad norm: 0.830 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2984/ 128728 | consumed samples: 47744 | consumed tokens: 97779712 | elapsed time per iteration (s): 15.24 | learning rate: 1.564E-05 | global batch size: 16 | lm loss: 6.011197E+00 | grad norm: 1.073 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2985/ 128728 | consumed samples: 47760 | consumed tokens: 97812480 | elapsed time per iteration (s): 15.16 | learning rate: 1.565E-05 | global batch size: 16 | lm loss: 5.898206E+00 | grad norm: 0.940 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2986/ 128728 | consumed samples: 47776 | consumed tokens: 97845248 | elapsed time per iteration (s): 15.22 | learning rate: 1.566E-05 | global batch size: 16 | lm loss: 5.987176E+00 | grad norm: 0.861 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 2987/ 128728 | consumed samples: 47792 | consumed tokens: 97878016 | elapsed time per iteration (s): 15.16 | learning rate: 1.566E-05 | global batch size: 16 | lm loss: 5.976408E+00 | grad norm: 0.850 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 2988/ 128728 | consumed samples: 47808 | consumed tokens: 97910784 | elapsed time per iteration (s): 15.22 | learning rate: 1.567E-05 | global batch size: 16 | lm loss: 5.972953E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2989/ 128728 | consumed samples: 47824 | consumed tokens: 97943552 | elapsed time per iteration (s): 15.25 | learning rate: 1.567E-05 | global batch size: 16 | lm loss: 6.006942E+00 | grad norm: 0.716 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 2990/ 128728 | consumed samples: 47840 | consumed tokens: 97976320 | elapsed time per iteration (s): 15.19 | learning rate: 1.568E-05 | global batch size: 16 | lm loss: 5.912127E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2991/ 128728 | consumed samples: 47856 | consumed tokens: 98009088 | elapsed time per iteration (s): 15.20 | learning rate: 1.568E-05 | global batch size: 16 | lm loss: 5.904402E+00 | grad norm: 0.789 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2992/ 128728 | consumed samples: 47872 | consumed tokens: 98041856 | elapsed time per iteration (s): 15.23 | learning rate: 1.569E-05 | global batch size: 16 | lm loss: 5.815178E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2993/ 128728 | consumed samples: 47888 | consumed tokens: 98074624 | elapsed time per iteration (s): 15.16 | learning rate: 1.569E-05 | global batch size: 16 | lm loss: 5.658585E+00 | grad norm: 0.670 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 2994/ 128728 | consumed samples: 47904 | consumed tokens: 98107392 | elapsed time per iteration (s): 15.21 | learning rate: 1.570E-05 | global batch size: 16 | lm loss: 5.849427E+00 | grad norm: 1.324 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 2995/ 128728 | consumed samples: 47920 | consumed tokens: 98140160 | elapsed time per iteration (s): 15.23 | learning rate: 1.570E-05 | global batch size: 16 | lm loss: 5.904146E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2996/ 128728 | consumed samples: 47936 | consumed tokens: 98172928 | elapsed time per iteration (s): 15.20 | learning rate: 1.571E-05 | global batch size: 16 | lm loss: 5.926609E+00 | grad norm: 0.650 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 2997/ 128728 | consumed samples: 47952 | consumed tokens: 98205696 | elapsed time per iteration (s): 15.22 | learning rate: 1.571E-05 | global batch size: 16 | lm loss: 6.086730E+00 | grad norm: 1.033 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 2998/ 128728 | consumed samples: 47968 | consumed tokens: 98238464 | elapsed time per iteration (s): 15.23 | learning rate: 1.572E-05 | global batch size: 16 | lm loss: 5.667955E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 2999/ 128728 | consumed samples: 47984 | consumed tokens: 98271232 | elapsed time per iteration (s): 15.21 | learning rate: 1.572E-05 | global batch size: 16 | lm loss: 5.905001E+00 | grad norm: 0.716 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3000/ 128728 | consumed samples: 48000 | consumed tokens: 98304000 | elapsed time per iteration (s): 15.21 | learning rate: 1.573E-05 | global batch size: 16 | lm loss: 6.000812E+00 | grad norm: 0.670 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default0]:saving checkpoint at iteration 3000 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default7]:------------------------------------------------------------------------------------------ [default7]:valid loss at iteration 3000 | lm loss value: 6.276583E+00 | lm loss PPL: 5.319677E+02 | [default7]:------------------------------------------------------------------------------------------ [default1]:[2022-03-03 18:40:11,519] [INFO] [logging.py:69:log_dist] [Rank 1] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/mp_rank_01_model_states.pt [default0]:[2022-03-03 18:40:11,489] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/mp_rank_00_model_states.pt [default1]:[2022-03-03 18:40:25,695] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_41_optim_states.pt [default5]:[2022-03-03 18:40:25,816] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_41_optim_states.pt [default4]:[2022-03-03 18:40:25,961] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_40_optim_states.pt [default0]:[2022-03-03 18:40:26,010] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_40_optim_states.pt [default3]:[2022-03-03 18:40:26,005] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_43_optim_states.pt [default0]:[2022-03-03 18:40:26,087] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_32_optim_states.pt [default6]:[2022-03-03 18:40:26,210] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_42_optim_states.pt [default7]:[2022-03-03 18:40:26,250] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_43_optim_states.pt [default2]:[2022-03-03 18:40:26,445] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_42_optim_states.pt [default6]:[2022-03-03 18:40:26,859] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_34_optim_states.pt [default1]:[2022-03-03 18:40:26,799] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_33_optim_states.pt [default4]:[2022-03-03 18:40:26,902] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_32_optim_states.pt [default7]:[2022-03-03 18:40:26,868] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_35_optim_states.pt [default5]:[2022-03-03 18:40:27,033] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_33_optim_states.pt [default2]:[2022-03-03 18:40:27,142] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_30_optim_states.pt [default2]:[2022-03-03 18:40:27,196] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_34_optim_states.pt [default3]:[2022-03-03 18:40:27,139] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_35_optim_states.pt [default7]:[2022-03-03 18:40:27,480] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_31_optim_states.pt [default0]:[2022-03-03 18:40:27,665] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_28_optim_states.pt [default6]:[2022-03-03 18:40:27,650] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_30_optim_states.pt [default3]:[2022-03-03 18:40:27,666] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_31_optim_states.pt [default5]:[2022-03-03 18:40:27,791] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_29_optim_states.pt [default4]:[2022-03-03 18:40:27,911] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_44_optim_states.pt [default1]:[2022-03-03 18:40:27,933] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_29_optim_states.pt [default7]:[2022-03-03 18:40:28,213] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_39_optim_states.pt [default4]:[2022-03-03 18:40:28,223] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_28_optim_states.pt [default4]:[2022-03-03 18:40:28,222] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_36_optim_states.pt [default2]:[2022-03-03 18:40:28,235] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_14_optim_states.pt [default4]:[2022-03-03 18:40:28,343] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_12_optim_states.pt [default3]:[2022-03-03 18:40:28,330] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_39_optim_states.pt [default5]:[2022-03-03 18:40:28,457] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_13_optim_states.pt [default7]:[2022-03-03 18:40:28,521] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_39_optim_states.pt [default5]:[2022-03-03 18:40:28,526] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_37_optim_states.pt [default6]:[2022-03-03 18:40:28,556] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_38_optim_states.pt [default1]:[2022-03-03 18:40:28,589] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_37_optim_states.pt [default2]:[2022-03-03 18:40:28,719] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_34_optim_states.pt [default4]:[2022-03-03 18:40:28,763] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_40_optim_states.pt [default2]:[2022-03-03 18:40:28,880] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_38_optim_states.pt [default7]:[2022-03-03 18:40:28,874] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_35_optim_states.pt [default5]:[2022-03-03 18:40:28,877] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_37_optim_states.pt [default6]:[2022-03-03 18:40:28,973] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_34_optim_states.pt [default3]:[2022-03-03 18:40:28,957] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_35_optim_states.pt [default3]:[2022-03-03 18:40:29,134] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_35_optim_states.pt [default0]:[2022-03-03 18:40:29,218] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_16_optim_states.pt [default1]:[2022-03-03 18:40:29,238] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_25_optim_states.pt [default7]:[2022-03-03 18:40:29,314] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_27_optim_states.pt [default5]:[2022-03-03 18:40:29,335] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_45_optim_states.pt [default6]:[2022-03-03 18:40:29,325] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_26_optim_states.pt [default1]:[2022-03-03 18:40:29,350] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_33_optim_states.pt [default1]:[2022-03-03 18:40:29,352] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_09_optim_states.pt [default4]:[2022-03-03 18:40:29,410] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_32_optim_states.pt [default0]:[2022-03-03 18:40:29,422] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_32_optim_states.pt [default2]:[2022-03-03 18:40:29,488] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_34_optim_states.pt [default0]:[2022-03-03 18:40:29,537] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_40_optim_states.pt [default7]:[2022-03-03 18:40:29,477] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_15_optim_states.pt [default3]:[2022-03-03 18:40:29,648] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_47_optim_states.pt [default3]:[2022-03-03 18:40:29,671] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_31_optim_states.pt [default0]:[2022-03-03 18:40:29,664] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_32_optim_states.pt [default2]:[2022-03-03 18:40:29,738] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_46_optim_states.pt [default1]:[2022-03-03 18:40:29,771] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_29_optim_states.pt [default0]:[2022-03-03 18:40:29,877] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_36_optim_states.pt [default7]:[2022-03-03 18:40:29,861] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_35_optim_states.pt [default6]:[2022-03-03 18:40:29,798] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_38_optim_states.pt [default4]:[2022-03-03 18:40:29,771] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt [default5]:[2022-03-03 18:40:29,867] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_33_optim_states.pt [default2]:[2022-03-03 18:40:29,886] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_30_optim_states.pt [default0]:[2022-03-03 18:40:29,917] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_28_optim_states.pt [default1]:[2022-03-03 18:40:29,950] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_33_optim_states.pt [default4]:[2022-03-03 18:40:29,968] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_32_optim_states.pt [default3]:[2022-03-03 18:40:30,005] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_15_optim_states.pt [default0]:[2022-03-03 18:40:29,994] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_08_optim_states.pt [default3]:[2022-03-03 18:40:29,984] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_31_optim_states.pt [default2]:[2022-03-03 18:40:30,019] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_42_optim_states.pt [default1]:[2022-03-03 18:40:29,969] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_37_optim_states.pt [default5]:[2022-03-03 18:40:30,116] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_41_optim_states.pt [default6]:[2022-03-03 18:40:30,222] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_14_optim_states.pt [default3]:[2022-03-03 18:40:30,224] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_27_optim_states.pt [default3]:[2022-03-03 18:40:30,227] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_43_optim_states.pt [default5]:[2022-03-03 18:40:30,288] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_33_optim_states.pt [default6]:[2022-03-03 18:40:30,293] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_34_optim_states.pt [default7]:[2022-03-03 18:40:30,301] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_15_optim_states.pt [default5]:[2022-03-03 18:40:30,348] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt [default2]:[2022-03-03 18:40:30,403] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_26_optim_states.pt [default2]:[2022-03-03 18:40:30,326] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_10_optim_states.pt [default0]:[2022-03-03 18:40:30,398] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_28_optim_states.pt [default2]:[2022-03-03 18:40:30,354] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_30_optim_states.pt [default4]:[2022-03-03 18:40:30,430] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_08_optim_states.pt [default0]:[2022-03-03 18:40:30,489] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_24_optim_states.pt [default6]:[2022-03-03 18:40:30,574] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_30_optim_states.pt [default4]:[2022-03-03 18:40:30,582] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_36_optim_states.pt [default0]:[2022-03-03 18:40:30,564] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_36_optim_states.pt [default1]:[2022-03-03 18:40:30,675] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_45_optim_states.pt [default5]:[2022-03-03 18:40:30,739] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_17_optim_states.pt [default5]:[2022-03-03 18:40:30,713] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_29_optim_states.pt [default6]:[2022-03-03 18:40:30,731] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_46_optim_states.pt [default1]:[2022-03-03 18:40:30,733] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_21_optim_states.pt [default7]:[2022-03-03 18:40:30,763] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_47_optim_states.pt [default0]:[2022-03-03 18:40:30,827] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_20_optim_states.pt [default0]:[2022-03-03 18:40:30,841] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_44_optim_states.pt [default7]:[2022-03-03 18:40:30,840] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_11_optim_states.pt [default1]:[2022-03-03 18:40:30,908] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_17_optim_states.pt [default6]:[2022-03-03 18:40:30,788] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_14_optim_states.pt [default1]:[2022-03-03 18:40:30,857] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_01_optim_states.pt [default4]:[2022-03-03 18:40:30,956] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_28_optim_states.pt [default4]:[2022-03-03 18:40:30,985] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_24_optim_states.pt [default6]:[2022-03-03 18:40:30,953] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_42_optim_states.pt [default1]:[2022-03-03 18:40:31,023] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_41_optim_states.pt [default3]:[2022-03-03 18:40:31,011] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_11_optim_states.pt [default7]:[2022-03-03 18:40:30,994] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_31_optim_states.pt [default2]:[2022-03-03 18:40:31,056] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_38_optim_states.pt [default5]:[2022-03-03 18:40:30,985] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_09_optim_states.pt [default4]:[2022-03-03 18:40:31,072] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_12_optim_states.pt [default3]:[2022-03-03 18:40:30,997] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_27_optim_states.pt [default3]:[2022-03-03 18:40:31,109] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_39_optim_states.pt [default2]:[2022-03-03 18:40:31,036] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_26_optim_states.pt [default1]:[2022-03-03 18:40:31,076] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_13_optim_states.pt [default0]:[2022-03-03 18:40:31,090] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_12_optim_states.pt [default4]:[2022-03-03 18:40:31,086] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_40_optim_states.pt [default5]:[2022-03-03 18:40:31,054] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_41_optim_states.pt [default7]:[2022-03-03 18:40:31,087] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_43_optim_states.pt [default5]:[2022-03-03 18:40:31,067] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_13_optim_states.pt [default0]:[2022-03-03 18:40:31,060] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt [default4]:[2022-03-03 18:40:31,092] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_28_optim_states.pt [default1]:[2022-03-03 18:40:31,223] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_29_optim_states.pt [default5]:[2022-03-03 18:40:31,236] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_25_optim_states.pt [default7]:[2022-03-03 18:40:31,333] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_43_optim_states.pt [default4]:[2022-03-03 18:40:31,358] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_04_optim_states.pt [default6]:[2022-03-03 18:40:31,410] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_10_optim_states.pt [default6]:[2022-03-03 18:40:31,452] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_02_optim_states.pt [default5]:[2022-03-03 18:40:31,418] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_17_optim_states.pt [default1]:[2022-03-03 18:40:31,395] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_01_optim_states.pt [default0]:[2022-03-03 18:40:31,390] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [default5]:[2022-03-03 18:40:31,432] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_25_optim_states.pt [default3]:[2022-03-03 18:40:31,438] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_43_optim_states.pt [default1]:[2022-03-03 18:40:31,484] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_17_optim_states.pt [default4]:[2022-03-03 18:40:31,484] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_24_optim_states.pt [default6]:[2022-03-03 18:40:31,465] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_10_optim_states.pt [default2]:[2022-03-03 18:40:31,482] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_22_optim_states.pt [default3]:[2022-03-03 18:40:31,556] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_43_optim_states.pt [default2]:[2022-03-03 18:40:31,503] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_42_optim_states.pt [default2]:[2022-03-03 18:40:31,507] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_42_optim_states.pt [default3]:[2022-03-03 18:40:31,607] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_23_optim_states.pt [default5]:[2022-03-03 18:40:31,586] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_29_optim_states.pt [default4]:[2022-03-03 18:40:31,685] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_20_optim_states.pt [default2]:[2022-03-03 18:40:31,669] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_26_optim_states.pt [default2]:[2022-03-03 18:40:31,721] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_14_optim_states.pt [default2]:[2022-03-03 18:40:31,825] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_18_optim_states.pt [default3]:[2022-03-03 18:40:31,878] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_19_optim_states.pt [default0]:[2022-03-03 18:40:31,843] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_20_optim_states.pt [default2]:[2022-03-03 18:40:31,926] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_18_optim_states.pt [default7]:[2022-03-03 18:40:31,905] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_47_optim_states.pt [default4]:[2022-03-03 18:40:32,068] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_36_optim_states.pt [default0]:[2022-03-03 18:40:32,060] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_04_optim_states.pt [default3]:[2022-03-03 18:40:32,215] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_19_optim_states.pt [default5]:[2022-03-03 18:40:32,215] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_37_optim_states.pt [default1]:[2022-03-03 18:40:32,318] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_05_optim_states.pt [default4]:[2022-03-03 18:40:32,245] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_16_optim_states.pt [default3]:[2022-03-03 18:40:32,252] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_23_optim_states.pt [default4]:[2022-03-03 18:40:32,341] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_16_optim_states.pt [default7]:[2022-03-03 18:40:32,282] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_19_optim_states.pt [default7]:[2022-03-03 18:40:32,285] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_31_optim_states.pt [default1]:[2022-03-03 18:40:32,358] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_41_optim_states.pt [default0]:[2022-03-03 18:40:32,366] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_16_optim_states.pt [default1]:[2022-03-03 18:40:32,370] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_21_optim_states.pt [default6]:[2022-03-03 18:40:32,335] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_42_optim_states.pt [default6]:[2022-03-03 18:40:32,409] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_30_optim_states.pt [default0]:[2022-03-03 18:40:32,366] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_12_optim_states.pt [default5]:[2022-03-03 18:40:32,411] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_21_optim_states.pt [default1]:[2022-03-03 18:40:32,446] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_13_optim_states.pt [default0]:[2022-03-03 18:40:32,489] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_40_optim_states.pt [default5]:[2022-03-03 18:40:32,525] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_21_optim_states.pt [default3]:[2022-03-03 18:40:32,557] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_15_optim_states.pt [default4]:[2022-03-03 18:40:32,572] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_40_optim_states.pt [default2]:[2022-03-03 18:40:32,704] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_18_optim_states.pt [default6]:[2022-03-03 18:40:32,798] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_18_optim_states.pt [default0]:[2022-03-03 18:40:32,992] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_12_optim_states.pt [default5]:[2022-03-03 18:40:32,908] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_41_optim_states.pt [default7]:[2022-03-03 18:40:33,010] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_15_optim_states.pt [default6]:[2022-03-03 18:40:33,021] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_14_optim_states.pt [default5]:[2022-03-03 18:40:33,032] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_17_optim_states.pt [default3]:[2022-03-03 18:40:32,991] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_19_optim_states.pt [default3]:[2022-03-03 18:40:33,013] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_15_optim_states.pt [default1]:[2022-03-03 18:40:33,071] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_13_optim_states.pt [default1]:[2022-03-03 18:40:32,958] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_21_optim_states.pt [default4]:[2022-03-03 18:40:33,097] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_16_optim_states.pt [default3]:[2022-03-03 18:40:33,209] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_23_optim_states.pt [default2]:[2022-03-03 18:40:33,146] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_14_optim_states.pt [default3]:[2022-03-03 18:40:33,193] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_27_optim_states.pt [default6]:[2022-03-03 18:40:33,233] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_06_optim_states.pt [default2]:[2022-03-03 18:40:33,241] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_22_optim_states.pt [default4]:[2022-03-03 18:40:33,266] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_20_optim_states.pt [default4]:[2022-03-03 18:40:33,217] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_12_optim_states.pt [default5]:[2022-03-03 18:40:33,400] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_13_optim_states.pt [default1]:[2022-03-03 18:40:33,455] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_25_optim_states.pt [default0]:[2022-03-03 18:40:33,499] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_12_optim_states.pt [default6]:[2022-03-03 18:40:33,543] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_46_optim_states.pt [default4]:[2022-03-03 18:40:33,590] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_04_optim_states.pt [default1]:[2022-03-03 18:40:33,605] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_01_optim_states.pt [default6]:[2022-03-03 18:40:33,612] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_10_optim_states.pt [default1]:[2022-03-03 18:40:33,664] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_29_optim_states.pt [default0]:[2022-03-03 18:40:33,732] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt [default1]:[2022-03-03 18:40:33,716] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_45_optim_states.pt [default0]:[2022-03-03 18:40:33,659] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_24_optim_states.pt [default2]:[2022-03-03 18:40:33,773] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_06_optim_states.pt [default0]:[2022-03-03 18:40:33,736] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_28_optim_states.pt [default0]:[2022-03-03 18:40:33,780] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_24_optim_states.pt [default5]:[2022-03-03 18:40:33,784] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_05_optim_states.pt [default0]:[2022-03-03 18:40:33,759] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_08_optim_states.pt [default1]:[2022-03-03 18:40:33,841] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_25_optim_states.pt [default1]:[2022-03-03 18:40:33,706] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_13_optim_states.pt [default1]:[2022-03-03 18:40:33,872] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_09_optim_states.pt [default7]:[2022-03-03 18:40:33,826] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_03_optim_states.pt [default7]:[2022-03-03 18:40:33,869] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_11_optim_states.pt [default6]:[2022-03-03 18:40:33,839] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_42_optim_states.pt [default7]:[2022-03-03 18:40:33,852] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_43_optim_states.pt [default7]:[2022-03-03 18:40:33,894] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_23_optim_states.pt [default3]:[2022-03-03 18:40:33,912] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_03_optim_states.pt [default2]:[2022-03-03 18:40:33,974] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_02_optim_states.pt [default4]:[2022-03-03 18:40:33,902] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_12_optim_states.pt [default6]:[2022-03-03 18:40:33,996] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_22_optim_states.pt [default7]:[2022-03-03 18:40:34,060] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_47_optim_states.pt [default2]:[2022-03-03 18:40:34,129] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_38_optim_states.pt [default2]:[2022-03-03 18:40:34,157] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_02_optim_states.pt [default6]:[2022-03-03 18:40:34,128] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_18_optim_states.pt [default0]:[2022-03-03 18:40:34,147] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_44_optim_states.pt [default6]:[2022-03-03 18:40:34,152] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_38_optim_states.pt [default6]:[2022-03-03 18:40:34,135] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_46_optim_states.pt [default1]:[2022-03-03 18:40:34,161] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_45_optim_states.pt [default3]:[2022-03-03 18:40:34,209] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_07_optim_states.pt [default1]:[2022-03-03 18:40:34,254] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_41_optim_states.pt [default6]:[2022-03-03 18:40:34,278] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_02_optim_states.pt [default2]:[2022-03-03 18:40:34,230] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_06_optim_states.pt [default0]:[2022-03-03 18:40:34,223] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_40_optim_states.pt [default3]:[2022-03-03 18:40:34,299] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_03_optim_states.pt [default5]:[2022-03-03 18:40:34,254] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_13_optim_states.pt [default7]:[2022-03-03 18:40:34,307] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_03_optim_states.pt [default6]:[2022-03-03 18:40:34,441] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_26_optim_states.pt [default3]:[2022-03-03 18:40:34,435] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_07_optim_states.pt [default7]:[2022-03-03 18:40:34,353] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_15_optim_states.pt [default7]:[2022-03-03 18:40:34,431] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_07_optim_states.pt [default6]:[2022-03-03 18:40:34,387] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_14_optim_states.pt [default7]:[2022-03-03 18:40:34,476] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_27_optim_states.pt [default2]:[2022-03-03 18:40:34,562] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_06_optim_states.pt [default7]:[2022-03-03 18:40:34,627] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_39_optim_states.pt [default0]:[2022-03-03 18:40:34,724] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_24_optim_states.pt [default6]:[2022-03-03 18:40:34,761] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_18_optim_states.pt [default1]:[2022-03-03 18:40:34,786] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_05_optim_states.pt [default7]:[2022-03-03 18:40:34,756] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_19_optim_states.pt [default3]:[2022-03-03 18:40:34,775] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_39_optim_states.pt [default2]:[2022-03-03 18:40:34,932] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_02_optim_states.pt [default5]:[2022-03-03 18:40:34,905] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_01_optim_states.pt [default4]:[2022-03-03 18:40:34,914] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt [default3]:[2022-03-03 18:40:34,950] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_03_optim_states.pt [default0]:[2022-03-03 18:40:34,931] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_04_optim_states.pt [default0]:[2022-03-03 18:40:35,061] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_44_optim_states.pt [default2]:[2022-03-03 18:40:35,035] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_46_optim_states.pt [default2]:[2022-03-03 18:40:35,039] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_22_optim_states.pt [default3]:[2022-03-03 18:40:35,087] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_03_optim_states.pt [default7]:[2022-03-03 18:40:35,135] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_07_optim_states.pt [default0]:[2022-03-03 18:40:35,221] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_04_optim_states.pt [default2]:[2022-03-03 18:40:35,324] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_30_optim_states.pt [default5]:[2022-03-03 18:40:35,320] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_01_optim_states.pt [default2]:[2022-03-03 18:40:35,299] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_02_optim_states.pt [default4]:[2022-03-03 18:40:35,359] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt [default5]:[2022-03-03 18:40:35,379] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_01_optim_states.pt [default1]:[2022-03-03 18:40:35,419] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_37_optim_states.pt [default0]:[2022-03-03 18:40:35,393] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_36_optim_states.pt [default5]:[2022-03-03 18:40:35,450] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_05_optim_states.pt [default6]:[2022-03-03 18:40:35,516] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_06_optim_states.pt [default6]:[2022-03-03 18:40:35,561] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_02_optim_states.pt [default7]:[2022-03-03 18:40:35,465] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_11_optim_states.pt [default1]:[2022-03-03 18:40:35,587] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_05_optim_states.pt [default2]:[2022-03-03 18:40:35,718] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_14_optim_states.pt [default2]:[2022-03-03 18:40:35,699] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_06_optim_states.pt [default3]:[2022-03-03 18:40:35,702] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_15_optim_states.pt [default3]:[2022-03-03 18:40:35,684] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_47_optim_states.pt [default0]:[2022-03-03 18:40:35,848] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_44_optim_states.pt [default5]:[2022-03-03 18:40:35,930] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_33_optim_states.pt [default3]:[2022-03-03 18:40:36,014] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_35_optim_states.pt [default2]:[2022-03-03 18:40:36,027] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_34_optim_states.pt [default3]:[2022-03-03 18:40:36,138] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_23_optim_states.pt [default2]:[2022-03-03 18:40:36,102] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_10_optim_states.pt [default4]:[2022-03-03 18:40:36,126] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt [default3]:[2022-03-03 18:40:36,133] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_11_optim_states.pt [default0]:[2022-03-03 18:40:36,208] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_20_optim_states.pt [default7]:[2022-03-03 18:40:36,230] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_03_optim_states.pt [default2]:[2022-03-03 18:40:36,235] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_26_optim_states.pt [default1]:[2022-03-03 18:40:36,262] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_25_optim_states.pt [default7]:[2022-03-03 18:40:36,295] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_27_optim_states.pt [default3]:[2022-03-03 18:40:36,286] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_27_optim_states.pt [default7]:[2022-03-03 18:40:36,312] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_23_optim_states.pt [default5]:[2022-03-03 18:40:36,281] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_05_optim_states.pt [default2]:[2022-03-03 18:40:36,283] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_22_optim_states.pt [default4]:[2022-03-03 18:40:36,415] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_04_optim_states.pt [default6]:[2022-03-03 18:40:36,361] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_22_optim_states.pt [default0]:[2022-03-03 18:40:36,406] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_32_optim_states.pt [default5]:[2022-03-03 18:40:36,442] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_09_optim_states.pt [default6]:[2022-03-03 18:40:36,502] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_18_optim_states.pt [default3]:[2022-03-03 18:40:36,506] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_07_optim_states.pt [default7]:[2022-03-03 18:40:36,586] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_19_optim_states.pt [default5]:[2022-03-03 18:40:36,644] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_21_optim_states.pt [default3]:[2022-03-03 18:40:36,695] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_07_optim_states.pt [default4]:[2022-03-03 18:40:36,677] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_20_optim_states.pt [default4]:[2022-03-03 18:40:36,761] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_24_optim_states.pt [default1]:[2022-03-03 18:40:36,776] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_01_optim_states.pt [default0]:[2022-03-03 18:40:36,821] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_36_optim_states.pt [default3]:[2022-03-03 18:40:36,792] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_31_optim_states.pt [default0]:[2022-03-03 18:40:36,831] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_16_optim_states.pt [default4]:[2022-03-03 18:40:36,850] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_32_optim_states.pt [default1]:[2022-03-03 18:40:36,842] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_37_optim_states.pt [default6]:[2022-03-03 18:40:36,885] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_34_optim_states.pt [default5]:[2022-03-03 18:40:36,984] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_45_optim_states.pt [default6]:[2022-03-03 18:40:36,989] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_22_optim_states.pt [default6]:[2022-03-03 18:40:36,974] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_46_optim_states.pt [default4]:[2022-03-03 18:40:36,999] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_08_optim_states.pt [default7]:[2022-03-03 18:40:36,980] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_23_optim_states.pt [default1]:[2022-03-03 18:40:36,984] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_45_optim_states.pt [default1]:[2022-03-03 18:40:37,014] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_33_optim_states.pt [default6]:[2022-03-03 18:40:37,059] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_26_optim_states.pt [default0]:[2022-03-03 18:40:37,121] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_20_optim_states.pt [default1]:[2022-03-03 18:40:37,160] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_21_optim_states.pt [default4]:[2022-03-03 18:40:37,243] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_36_optim_states.pt [default7]:[2022-03-03 18:40:37,300] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_19_optim_states.pt [default6]:[2022-03-03 18:40:37,245] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_30_optim_states.pt [default7]:[2022-03-03 18:40:37,315] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_35_optim_states.pt [default1]:[2022-03-03 18:40:37,278] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_6_mp_rank_17_optim_states.pt [default3]:[2022-03-03 18:40:37,310] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_47_optim_states.pt [default5]:[2022-03-03 18:40:37,341] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_37_optim_states.pt [default2]:[2022-03-03 18:40:37,322] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_46_optim_states.pt [default2]:[2022-03-03 18:40:37,369] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_46_optim_states.pt [default5]:[2022-03-03 18:40:37,393] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_25_optim_states.pt [default4]:[2022-03-03 18:40:37,482] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_44_optim_states.pt [default6]:[2022-03-03 18:40:37,609] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_06_optim_states.pt [default4]:[2022-03-03 18:40:37,653] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_08_optim_states.pt [default7]:[2022-03-03 18:40:37,627] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_47_optim_states.pt [default5]:[2022-03-03 18:40:37,670] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_29_optim_states.pt [default4]:[2022-03-03 18:40:37,665] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_04_optim_states.pt [default2]:[2022-03-03 18:40:37,714] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_10_optim_states.pt [default7]:[2022-03-03 18:40:37,710] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_07_optim_states.pt [default5]:[2022-03-03 18:40:37,964] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_45_optim_states.pt [default7]:[2022-03-03 18:40:37,893] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_31_optim_states.pt [default3]:[2022-03-03 18:40:37,992] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_19_optim_states.pt [default2]:[2022-03-03 18:40:37,965] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_18_optim_states.pt [default5]:[2022-03-03 18:40:37,969] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_05_optim_states.pt [default3]:[2022-03-03 18:40:37,994] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_11_optim_states.pt [default4]:[2022-03-03 18:40:38,149] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_28_optim_states.pt [default3]:[2022-03-03 18:40:38,144] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_47_optim_states.pt [default0]:[2022-03-03 18:40:38,123] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_08_optim_states.pt [default4]:[2022-03-03 18:40:38,227] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_44_optim_states.pt [default2]:[2022-03-03 18:40:38,428] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_10_optim_states.pt [default3]:[2022-03-03 18:40:38,494] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_11_optim_states.pt [default1]:[2022-03-03 18:40:38,561] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_17_optim_states.pt [default1]:[2022-03-03 18:40:38,631] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_09_optim_states.pt [default0]:[2022-03-03 18:40:38,650] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_16_optim_states.pt [default4]:[2022-03-03 18:40:38,747] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_20_optim_states.pt [default0]:[2022-03-03 18:40:38,717] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt [default4]:[2022-03-03 18:40:38,780] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_24_optim_states.pt [default7]:[2022-03-03 18:40:38,813] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_39_optim_states.pt [default5]:[2022-03-03 18:40:38,844] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_09_optim_states.pt [default5]:[2022-03-03 18:40:38,991] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_21_optim_states.pt [default6]:[2022-03-03 18:40:39,008] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_38_optim_states.pt [default6]:[2022-03-03 18:40:39,019] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_26_optim_states.pt [default6]:[2022-03-03 18:40:39,043] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_02_optim_states.pt [default7]:[2022-03-03 18:40:39,154] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_03_optim_states.pt [default4]:[2022-03-03 18:40:39,318] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_08_optim_states.pt [default5]:[2022-03-03 18:40:39,343] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_25_optim_states.pt [default0]:[2022-03-03 18:40:39,387] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_04_optim_states.pt [default7]:[2022-03-03 18:40:39,531] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_27_optim_states.pt [default1]:[2022-03-03 18:40:39,801] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_0_mp_rank_05_optim_states.pt [default6]:[2022-03-03 18:40:39,844] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_06_optim_states.pt [default4]:[2022-03-03 18:40:39,910] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_44_optim_states.pt [default4]:[2022-03-03 18:40:40,001] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_16_optim_states.pt [default5]:[2022-03-03 18:40:40,022] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_5_mp_rank_45_optim_states.pt [default7]:[2022-03-03 18:40:40,273] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_07_optim_states.pt [default5]:[2022-03-03 18:40:40,476] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_7_mp_rank_17_optim_states.pt [default5]:[2022-03-03 18:40:41,148] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_09_optim_states.pt [default7]:[2022-03-03 18:40:41,123] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_11_optim_states.pt [default7]:[2022-03-03 18:40:41,400] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_23_optim_states.pt [default6]:[2022-03-03 18:40:41,446] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_1_mp_rank_22_optim_states.pt [default6]:[2022-03-03 18:40:41,869] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_3_mp_rank_10_optim_states.pt [default3]:[2022-03-03 18:40:42,500] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_39_optim_states.pt [default2]:[2022-03-03 18:40:42,546] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_4_mp_rank_38_optim_states.pt [default0]:[2022-03-03 18:40:44,393] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_08_optim_states.pt [default7]:time (ms) | save-checkpoint: 42368.40 [default1]:[2022-03-03 18:40:44,416] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3000/bf16_zero_pp_rank_2_mp_rank_09_optim_states.pt [default0]: successfully saved checkpoint at iteration 3000 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default7]: iteration 3001/ 128728 | consumed samples: 48016 | consumed tokens: 98336768 | elapsed time per iteration (s): 77.11 | learning rate: 1.573E-05 | global batch size: 16 | lm loss: 5.976147E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 0.208 | TFLOPs: 1.59 | [default7]: iteration 3002/ 128728 | consumed samples: 48032 | consumed tokens: 98369536 | elapsed time per iteration (s): 15.17 | learning rate: 1.574E-05 | global batch size: 16 | lm loss: 5.967981E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3003/ 128728 | consumed samples: 48048 | consumed tokens: 98402304 | elapsed time per iteration (s): 15.19 | learning rate: 1.574E-05 | global batch size: 16 | lm loss: 5.914820E+00 | grad norm: 0.772 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3004/ 128728 | consumed samples: 48064 | consumed tokens: 98435072 | elapsed time per iteration (s): 15.24 | learning rate: 1.575E-05 | global batch size: 16 | lm loss: 5.897120E+00 | grad norm: 0.624 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3005/ 128728 | consumed samples: 48080 | consumed tokens: 98467840 | elapsed time per iteration (s): 15.25 | learning rate: 1.575E-05 | global batch size: 16 | lm loss: 5.955826E+00 | grad norm: 1.228 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.04 | [default7]: iteration 3006/ 128728 | consumed samples: 48096 | consumed tokens: 98500608 | elapsed time per iteration (s): 15.23 | learning rate: 1.576E-05 | global batch size: 16 | lm loss: 5.987964E+00 | grad norm: 0.892 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3007/ 128728 | consumed samples: 48112 | consumed tokens: 98533376 | elapsed time per iteration (s): 15.20 | learning rate: 1.577E-05 | global batch size: 16 | lm loss: 5.960895E+00 | grad norm: 0.671 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3008/ 128728 | consumed samples: 48128 | consumed tokens: 98566144 | elapsed time per iteration (s): 15.21 | learning rate: 1.577E-05 | global batch size: 16 | lm loss: 5.917996E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3009/ 128728 | consumed samples: 48144 | consumed tokens: 98598912 | elapsed time per iteration (s): 15.17 | learning rate: 1.578E-05 | global batch size: 16 | lm loss: 5.884965E+00 | grad norm: 0.687 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3010/ 128728 | consumed samples: 48160 | consumed tokens: 98631680 | elapsed time per iteration (s): 15.23 | learning rate: 1.578E-05 | global batch size: 16 | lm loss: 5.855898E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3011/ 128728 | consumed samples: 48176 | consumed tokens: 98664448 | elapsed time per iteration (s): 15.19 | learning rate: 1.579E-05 | global batch size: 16 | lm loss: 6.119870E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3012/ 128728 | consumed samples: 48192 | consumed tokens: 98697216 | elapsed time per iteration (s): 15.17 | learning rate: 1.579E-05 | global batch size: 16 | lm loss: 6.020233E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 3013/ 128728 | consumed samples: 48208 | consumed tokens: 98729984 | elapsed time per iteration (s): 15.24 | learning rate: 1.580E-05 | global batch size: 16 | lm loss: 6.022349E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3014/ 128728 | consumed samples: 48224 | consumed tokens: 98762752 | elapsed time per iteration (s): 15.20 | learning rate: 1.580E-05 | global batch size: 16 | lm loss: 5.822513E+00 | grad norm: 0.685 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3015/ 128728 | consumed samples: 48240 | consumed tokens: 98795520 | elapsed time per iteration (s): 15.22 | learning rate: 1.581E-05 | global batch size: 16 | lm loss: 5.816571E+00 | grad norm: 0.653 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3016/ 128728 | consumed samples: 48256 | consumed tokens: 98828288 | elapsed time per iteration (s): 15.22 | learning rate: 1.581E-05 | global batch size: 16 | lm loss: 5.939666E+00 | grad norm: 0.624 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3017/ 128728 | consumed samples: 48272 | consumed tokens: 98861056 | elapsed time per iteration (s): 15.23 | learning rate: 1.582E-05 | global batch size: 16 | lm loss: 5.874893E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3018/ 128728 | consumed samples: 48288 | consumed tokens: 98893824 | elapsed time per iteration (s): 15.21 | learning rate: 1.582E-05 | global batch size: 16 | lm loss: 6.413375E+00 | grad norm: 0.891 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3019/ 128728 | consumed samples: 48304 | consumed tokens: 98926592 | elapsed time per iteration (s): 15.23 | learning rate: 1.583E-05 | global batch size: 16 | lm loss: 5.910774E+00 | grad norm: 0.734 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3020/ 128728 | consumed samples: 48320 | consumed tokens: 98959360 | elapsed time per iteration (s): 15.28 | learning rate: 1.583E-05 | global batch size: 16 | lm loss: 5.987436E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 3021/ 128728 | consumed samples: 48336 | consumed tokens: 98992128 | elapsed time per iteration (s): 15.20 | learning rate: 1.584E-05 | global batch size: 16 | lm loss: 5.816168E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3022/ 128728 | consumed samples: 48352 | consumed tokens: 99024896 | elapsed time per iteration (s): 15.21 | learning rate: 1.584E-05 | global batch size: 16 | lm loss: 6.000154E+00 | grad norm: 0.941 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3023/ 128728 | consumed samples: 48368 | consumed tokens: 99057664 | elapsed time per iteration (s): 15.22 | learning rate: 1.585E-05 | global batch size: 16 | lm loss: 6.203218E+00 | grad norm: 0.801 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3024/ 128728 | consumed samples: 48384 | consumed tokens: 99090432 | elapsed time per iteration (s): 15.23 | learning rate: 1.585E-05 | global batch size: 16 | lm loss: 5.741538E+00 | grad norm: 0.689 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3025/ 128728 | consumed samples: 48400 | consumed tokens: 99123200 | elapsed time per iteration (s): 15.22 | learning rate: 1.586E-05 | global batch size: 16 | lm loss: 6.002611E+00 | grad norm: 0.778 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3026/ 128728 | consumed samples: 48416 | consumed tokens: 99155968 | elapsed time per iteration (s): 15.23 | learning rate: 1.586E-05 | global batch size: 16 | lm loss: 5.864077E+00 | grad norm: 0.708 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3027/ 128728 | consumed samples: 48432 | consumed tokens: 99188736 | elapsed time per iteration (s): 15.22 | learning rate: 1.587E-05 | global batch size: 16 | lm loss: 5.858949E+00 | grad norm: 0.682 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3028/ 128728 | consumed samples: 48448 | consumed tokens: 99221504 | elapsed time per iteration (s): 15.23 | learning rate: 1.588E-05 | global batch size: 16 | lm loss: 5.833308E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3029/ 128728 | consumed samples: 48464 | consumed tokens: 99254272 | elapsed time per iteration (s): 15.19 | learning rate: 1.588E-05 | global batch size: 16 | lm loss: 6.036957E+00 | grad norm: 0.705 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3030/ 128728 | consumed samples: 48480 | consumed tokens: 99287040 | elapsed time per iteration (s): 15.22 | learning rate: 1.589E-05 | global batch size: 16 | lm loss: 5.693832E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3031/ 128728 | consumed samples: 48496 | consumed tokens: 99319808 | elapsed time per iteration (s): 15.23 | learning rate: 1.589E-05 | global batch size: 16 | lm loss: 6.020626E+00 | grad norm: 0.854 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3032/ 128728 | consumed samples: 48512 | consumed tokens: 99352576 | elapsed time per iteration (s): 15.23 | learning rate: 1.590E-05 | global batch size: 16 | lm loss: 5.864520E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3033/ 128728 | consumed samples: 48528 | consumed tokens: 99385344 | elapsed time per iteration (s): 15.23 | learning rate: 1.590E-05 | global batch size: 16 | lm loss: 5.856801E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3034/ 128728 | consumed samples: 48544 | consumed tokens: 99418112 | elapsed time per iteration (s): 15.22 | learning rate: 1.591E-05 | global batch size: 16 | lm loss: 5.953742E+00 | grad norm: 0.785 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3035/ 128728 | consumed samples: 48560 | consumed tokens: 99450880 | elapsed time per iteration (s): 15.24 | learning rate: 1.591E-05 | global batch size: 16 | lm loss: 5.934213E+00 | grad norm: 0.662 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3036/ 128728 | consumed samples: 48576 | consumed tokens: 99483648 | elapsed time per iteration (s): 15.26 | learning rate: 1.592E-05 | global batch size: 16 | lm loss: 5.850968E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3037/ 128728 | consumed samples: 48592 | consumed tokens: 99516416 | elapsed time per iteration (s): 15.22 | learning rate: 1.592E-05 | global batch size: 16 | lm loss: 6.049872E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3038/ 128728 | consumed samples: 48608 | consumed tokens: 99549184 | elapsed time per iteration (s): 15.21 | learning rate: 1.593E-05 | global batch size: 16 | lm loss: 5.903430E+00 | grad norm: 0.843 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3039/ 128728 | consumed samples: 48624 | consumed tokens: 99581952 | elapsed time per iteration (s): 15.24 | learning rate: 1.593E-05 | global batch size: 16 | lm loss: 6.003817E+00 | grad norm: 0.796 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3040/ 128728 | consumed samples: 48640 | consumed tokens: 99614720 | elapsed time per iteration (s): 15.25 | learning rate: 1.594E-05 | global batch size: 16 | lm loss: 5.985853E+00 | grad norm: 0.811 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3041/ 128728 | consumed samples: 48656 | consumed tokens: 99647488 | elapsed time per iteration (s): 15.24 | learning rate: 1.594E-05 | global batch size: 16 | lm loss: 5.714824E+00 | grad norm: 0.649 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3042/ 128728 | consumed samples: 48672 | consumed tokens: 99680256 | elapsed time per iteration (s): 15.24 | learning rate: 1.595E-05 | global batch size: 16 | lm loss: 6.073945E+00 | grad norm: 0.778 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3043/ 128728 | consumed samples: 48688 | consumed tokens: 99713024 | elapsed time per iteration (s): 15.23 | learning rate: 1.595E-05 | global batch size: 16 | lm loss: 5.912009E+00 | grad norm: 0.801 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3044/ 128728 | consumed samples: 48704 | consumed tokens: 99745792 | elapsed time per iteration (s): 15.22 | learning rate: 1.596E-05 | global batch size: 16 | lm loss: 5.936331E+00 | grad norm: 0.807 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3045/ 128728 | consumed samples: 48720 | consumed tokens: 99778560 | elapsed time per iteration (s): 15.29 | learning rate: 1.596E-05 | global batch size: 16 | lm loss: 5.901987E+00 | grad norm: 0.736 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.046 | TFLOPs: 8.01 | [default7]: iteration 3046/ 128728 | consumed samples: 48736 | consumed tokens: 99811328 | elapsed time per iteration (s): 15.23 | learning rate: 1.597E-05 | global batch size: 16 | lm loss: 5.832729E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3047/ 128728 | consumed samples: 48752 | consumed tokens: 99844096 | elapsed time per iteration (s): 15.20 | learning rate: 1.598E-05 | global batch size: 16 | lm loss: 6.031357E+00 | grad norm: 0.744 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3048/ 128728 | consumed samples: 48768 | consumed tokens: 99876864 | elapsed time per iteration (s): 15.25 | learning rate: 1.598E-05 | global batch size: 16 | lm loss: 5.672740E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3049/ 128728 | consumed samples: 48784 | consumed tokens: 99909632 | elapsed time per iteration (s): 15.28 | learning rate: 1.599E-05 | global batch size: 16 | lm loss: 6.076912E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 3050/ 128728 | consumed samples: 48800 | consumed tokens: 99942400 | elapsed time per iteration (s): 15.20 | learning rate: 1.599E-05 | global batch size: 16 | lm loss: 5.738910E+00 | grad norm: 0.688 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3051/ 128728 | consumed samples: 48816 | consumed tokens: 99975168 | elapsed time per iteration (s): 15.28 | learning rate: 1.600E-05 | global batch size: 16 | lm loss: 5.781271E+00 | grad norm: 0.759 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 3052/ 128728 | consumed samples: 48832 | consumed tokens: 100007936 | elapsed time per iteration (s): 15.20 | learning rate: 1.600E-05 | global batch size: 16 | lm loss: 5.867689E+00 | grad norm: 0.851 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3053/ 128728 | consumed samples: 48848 | consumed tokens: 100040704 | elapsed time per iteration (s): 15.24 | learning rate: 1.601E-05 | global batch size: 16 | lm loss: 5.961505E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3054/ 128728 | consumed samples: 48864 | consumed tokens: 100073472 | elapsed time per iteration (s): 15.23 | learning rate: 1.601E-05 | global batch size: 16 | lm loss: 6.001435E+00 | grad norm: 0.837 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3055/ 128728 | consumed samples: 48880 | consumed tokens: 100106240 | elapsed time per iteration (s): 15.25 | learning rate: 1.602E-05 | global batch size: 16 | lm loss: 5.903691E+00 | grad norm: 0.686 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3056/ 128728 | consumed samples: 48896 | consumed tokens: 100139008 | elapsed time per iteration (s): 15.22 | learning rate: 1.602E-05 | global batch size: 16 | lm loss: 5.782066E+00 | grad norm: 0.823 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3057/ 128728 | consumed samples: 48912 | consumed tokens: 100171776 | elapsed time per iteration (s): 15.26 | learning rate: 1.603E-05 | global batch size: 16 | lm loss: 5.891513E+00 | grad norm: 1.014 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 3058/ 128728 | consumed samples: 48928 | consumed tokens: 100204544 | elapsed time per iteration (s): 15.22 | learning rate: 1.603E-05 | global batch size: 16 | lm loss: 5.959929E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3059/ 128728 | consumed samples: 48944 | consumed tokens: 100237312 | elapsed time per iteration (s): 15.19 | learning rate: 1.604E-05 | global batch size: 16 | lm loss: 5.808131E+00 | grad norm: 0.950 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3060/ 128728 | consumed samples: 48960 | consumed tokens: 100270080 | elapsed time per iteration (s): 15.23 | learning rate: 1.604E-05 | global batch size: 16 | lm loss: 5.985348E+00 | grad norm: 0.663 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3061/ 128728 | consumed samples: 48976 | consumed tokens: 100302848 | elapsed time per iteration (s): 15.23 | learning rate: 1.605E-05 | global batch size: 16 | lm loss: 5.834366E+00 | grad norm: 0.897 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3062/ 128728 | consumed samples: 48992 | consumed tokens: 100335616 | elapsed time per iteration (s): 15.23 | learning rate: 1.605E-05 | global batch size: 16 | lm loss: 5.852916E+00 | grad norm: 0.728 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3063/ 128728 | consumed samples: 49008 | consumed tokens: 100368384 | elapsed time per iteration (s): 15.24 | learning rate: 1.606E-05 | global batch size: 16 | lm loss: 6.065343E+00 | grad norm: 1.017 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3064/ 128728 | consumed samples: 49024 | consumed tokens: 100401152 | elapsed time per iteration (s): 15.21 | learning rate: 1.606E-05 | global batch size: 16 | lm loss: 5.798189E+00 | grad norm: 1.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3065/ 128728 | consumed samples: 49040 | consumed tokens: 100433920 | elapsed time per iteration (s): 15.20 | learning rate: 1.607E-05 | global batch size: 16 | lm loss: 5.934473E+00 | grad norm: 0.698 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3066/ 128728 | consumed samples: 49056 | consumed tokens: 100466688 | elapsed time per iteration (s): 15.19 | learning rate: 1.607E-05 | global batch size: 16 | lm loss: 5.927220E+00 | grad norm: 0.880 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3067/ 128728 | consumed samples: 49072 | consumed tokens: 100499456 | elapsed time per iteration (s): 15.25 | learning rate: 1.608E-05 | global batch size: 16 | lm loss: 5.879972E+00 | grad norm: 0.692 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3068/ 128728 | consumed samples: 49088 | consumed tokens: 100532224 | elapsed time per iteration (s): 15.23 | learning rate: 1.609E-05 | global batch size: 16 | lm loss: 5.720819E+00 | grad norm: 0.741 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3069/ 128728 | consumed samples: 49104 | consumed tokens: 100564992 | elapsed time per iteration (s): 15.22 | learning rate: 1.609E-05 | global batch size: 16 | lm loss: 6.000784E+00 | grad norm: 0.820 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3070/ 128728 | consumed samples: 49120 | consumed tokens: 100597760 | elapsed time per iteration (s): 15.23 | learning rate: 1.610E-05 | global batch size: 16 | lm loss: 5.922574E+00 | grad norm: 0.728 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3071/ 128728 | consumed samples: 49136 | consumed tokens: 100630528 | elapsed time per iteration (s): 15.21 | learning rate: 1.610E-05 | global batch size: 16 | lm loss: 5.932800E+00 | grad norm: 0.945 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3072/ 128728 | consumed samples: 49152 | consumed tokens: 100663296 | elapsed time per iteration (s): 15.24 | learning rate: 1.611E-05 | global batch size: 16 | lm loss: 5.778855E+00 | grad norm: 0.783 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3073/ 128728 | consumed samples: 49168 | consumed tokens: 100696064 | elapsed time per iteration (s): 15.20 | learning rate: 1.611E-05 | global batch size: 16 | lm loss: 5.972422E+00 | grad norm: 0.669 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3074/ 128728 | consumed samples: 49184 | consumed tokens: 100728832 | elapsed time per iteration (s): 15.22 | learning rate: 1.612E-05 | global batch size: 16 | lm loss: 5.960331E+00 | grad norm: 0.835 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3075/ 128728 | consumed samples: 49200 | consumed tokens: 100761600 | elapsed time per iteration (s): 15.22 | learning rate: 1.612E-05 | global batch size: 16 | lm loss: 5.690085E+00 | grad norm: 0.811 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3076/ 128728 | consumed samples: 49216 | consumed tokens: 100794368 | elapsed time per iteration (s): 15.23 | learning rate: 1.613E-05 | global batch size: 16 | lm loss: 5.920603E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3077/ 128728 | consumed samples: 49232 | consumed tokens: 100827136 | elapsed time per iteration (s): 15.24 | learning rate: 1.613E-05 | global batch size: 16 | lm loss: 6.182066E+00 | grad norm: 0.726 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3078/ 128728 | consumed samples: 49248 | consumed tokens: 100859904 | elapsed time per iteration (s): 15.25 | learning rate: 1.614E-05 | global batch size: 16 | lm loss: 5.818954E+00 | grad norm: 0.700 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3079/ 128728 | consumed samples: 49264 | consumed tokens: 100892672 | elapsed time per iteration (s): 15.20 | learning rate: 1.614E-05 | global batch size: 16 | lm loss: 5.869929E+00 | grad norm: 0.803 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3080/ 128728 | consumed samples: 49280 | consumed tokens: 100925440 | elapsed time per iteration (s): 15.23 | learning rate: 1.615E-05 | global batch size: 16 | lm loss: 5.978646E+00 | grad norm: 0.876 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3081/ 128728 | consumed samples: 49296 | consumed tokens: 100958208 | elapsed time per iteration (s): 15.21 | learning rate: 1.615E-05 | global batch size: 16 | lm loss: 5.753775E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3082/ 128728 | consumed samples: 49312 | consumed tokens: 100990976 | elapsed time per iteration (s): 15.23 | learning rate: 1.616E-05 | global batch size: 16 | lm loss: 5.812270E+00 | grad norm: 0.666 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3083/ 128728 | consumed samples: 49328 | consumed tokens: 101023744 | elapsed time per iteration (s): 15.20 | learning rate: 1.616E-05 | global batch size: 16 | lm loss: 5.786464E+00 | grad norm: 0.951 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3084/ 128728 | consumed samples: 49344 | consumed tokens: 101056512 | elapsed time per iteration (s): 15.24 | learning rate: 1.617E-05 | global batch size: 16 | lm loss: 5.646963E+00 | grad norm: 0.873 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3085/ 128728 | consumed samples: 49360 | consumed tokens: 101089280 | elapsed time per iteration (s): 15.22 | learning rate: 1.617E-05 | global batch size: 16 | lm loss: 6.141891E+00 | grad norm: 0.783 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3086/ 128728 | consumed samples: 49376 | consumed tokens: 101122048 | elapsed time per iteration (s): 15.23 | learning rate: 1.618E-05 | global batch size: 16 | lm loss: 5.876431E+00 | grad norm: 0.721 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3087/ 128728 | consumed samples: 49392 | consumed tokens: 101154816 | elapsed time per iteration (s): 15.22 | learning rate: 1.618E-05 | global batch size: 16 | lm loss: 5.696089E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3088/ 128728 | consumed samples: 49408 | consumed tokens: 101187584 | elapsed time per iteration (s): 15.23 | learning rate: 1.619E-05 | global batch size: 16 | lm loss: 5.823549E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3089/ 128728 | consumed samples: 49424 | consumed tokens: 101220352 | elapsed time per iteration (s): 15.22 | learning rate: 1.620E-05 | global batch size: 16 | lm loss: 5.682597E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3090/ 128728 | consumed samples: 49440 | consumed tokens: 101253120 | elapsed time per iteration (s): 15.22 | learning rate: 1.620E-05 | global batch size: 16 | lm loss: 5.883008E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3091/ 128728 | consumed samples: 49456 | consumed tokens: 101285888 | elapsed time per iteration (s): 15.21 | learning rate: 1.621E-05 | global batch size: 16 | lm loss: 5.790089E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3092/ 128728 | consumed samples: 49472 | consumed tokens: 101318656 | elapsed time per iteration (s): 15.22 | learning rate: 1.621E-05 | global batch size: 16 | lm loss: 6.044188E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3093/ 128728 | consumed samples: 49488 | consumed tokens: 101351424 | elapsed time per iteration (s): 15.24 | learning rate: 1.622E-05 | global batch size: 16 | lm loss: 5.811264E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3094/ 128728 | consumed samples: 49504 | consumed tokens: 101384192 | elapsed time per iteration (s): 15.23 | learning rate: 1.622E-05 | global batch size: 16 | lm loss: 5.842374E+00 | grad norm: 0.678 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3095/ 128728 | consumed samples: 49520 | consumed tokens: 101416960 | elapsed time per iteration (s): 15.19 | learning rate: 1.623E-05 | global batch size: 16 | lm loss: 5.868669E+00 | grad norm: 0.744 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3096/ 128728 | consumed samples: 49536 | consumed tokens: 101449728 | elapsed time per iteration (s): 15.18 | learning rate: 1.623E-05 | global batch size: 16 | lm loss: 5.716575E+00 | grad norm: 0.699 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3097/ 128728 | consumed samples: 49552 | consumed tokens: 101482496 | elapsed time per iteration (s): 15.17 | learning rate: 1.624E-05 | global batch size: 16 | lm loss: 5.883733E+00 | grad norm: 0.672 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3098/ 128728 | consumed samples: 49568 | consumed tokens: 101515264 | elapsed time per iteration (s): 15.22 | learning rate: 1.624E-05 | global batch size: 16 | lm loss: 5.890719E+00 | grad norm: 1.230 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3099/ 128728 | consumed samples: 49584 | consumed tokens: 101548032 | elapsed time per iteration (s): 15.23 | learning rate: 1.625E-05 | global batch size: 16 | lm loss: 5.930487E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3100/ 128728 | consumed samples: 49600 | consumed tokens: 101580800 | elapsed time per iteration (s): 15.23 | learning rate: 1.625E-05 | global batch size: 16 | lm loss: 5.982317E+00 | grad norm: 0.703 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3101/ 128728 | consumed samples: 49616 | consumed tokens: 101613568 | elapsed time per iteration (s): 15.24 | learning rate: 1.626E-05 | global batch size: 16 | lm loss: 5.826386E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3102/ 128728 | consumed samples: 49632 | consumed tokens: 101646336 | elapsed time per iteration (s): 15.23 | learning rate: 1.626E-05 | global batch size: 16 | lm loss: 5.526955E+00 | grad norm: 0.759 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3103/ 128728 | consumed samples: 49648 | consumed tokens: 101679104 | elapsed time per iteration (s): 15.24 | learning rate: 1.627E-05 | global batch size: 16 | lm loss: 5.959418E+00 | grad norm: 0.657 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3104/ 128728 | consumed samples: 49664 | consumed tokens: 101711872 | elapsed time per iteration (s): 15.22 | learning rate: 1.627E-05 | global batch size: 16 | lm loss: 5.816753E+00 | grad norm: 2.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3105/ 128728 | consumed samples: 49680 | consumed tokens: 101744640 | elapsed time per iteration (s): 15.24 | learning rate: 1.628E-05 | global batch size: 16 | lm loss: 5.825230E+00 | grad norm: 0.705 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3106/ 128728 | consumed samples: 49696 | consumed tokens: 101777408 | elapsed time per iteration (s): 15.23 | learning rate: 1.628E-05 | global batch size: 16 | lm loss: 6.096361E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3107/ 128728 | consumed samples: 49712 | consumed tokens: 101810176 | elapsed time per iteration (s): 15.23 | learning rate: 1.629E-05 | global batch size: 16 | lm loss: 5.705378E+00 | grad norm: 0.685 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3108/ 128728 | consumed samples: 49728 | consumed tokens: 101842944 | elapsed time per iteration (s): 15.20 | learning rate: 1.629E-05 | global batch size: 16 | lm loss: 5.947734E+00 | grad norm: 0.909 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3109/ 128728 | consumed samples: 49744 | consumed tokens: 101875712 | elapsed time per iteration (s): 15.24 | learning rate: 1.630E-05 | global batch size: 16 | lm loss: 5.886482E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3110/ 128728 | consumed samples: 49760 | consumed tokens: 101908480 | elapsed time per iteration (s): 15.17 | learning rate: 1.631E-05 | global batch size: 16 | lm loss: 5.945197E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 3111/ 128728 | consumed samples: 49776 | consumed tokens: 101941248 | elapsed time per iteration (s): 15.15 | learning rate: 1.631E-05 | global batch size: 16 | lm loss: 5.768273E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 3112/ 128728 | consumed samples: 49792 | consumed tokens: 101974016 | elapsed time per iteration (s): 15.16 | learning rate: 1.632E-05 | global batch size: 16 | lm loss: 5.848940E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3113/ 128728 | consumed samples: 49808 | consumed tokens: 102006784 | elapsed time per iteration (s): 15.18 | learning rate: 1.632E-05 | global batch size: 16 | lm loss: 5.794857E+00 | grad norm: 0.825 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3114/ 128728 | consumed samples: 49824 | consumed tokens: 102039552 | elapsed time per iteration (s): 15.20 | learning rate: 1.633E-05 | global batch size: 16 | lm loss: 5.761194E+00 | grad norm: 0.853 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3115/ 128728 | consumed samples: 49840 | consumed tokens: 102072320 | elapsed time per iteration (s): 15.23 | learning rate: 1.633E-05 | global batch size: 16 | lm loss: 5.966802E+00 | grad norm: 0.869 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3116/ 128728 | consumed samples: 49856 | consumed tokens: 102105088 | elapsed time per iteration (s): 15.23 | learning rate: 1.634E-05 | global batch size: 16 | lm loss: 5.814324E+00 | grad norm: 2.284 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3117/ 128728 | consumed samples: 49872 | consumed tokens: 102137856 | elapsed time per iteration (s): 15.23 | learning rate: 1.634E-05 | global batch size: 16 | lm loss: 5.953111E+00 | grad norm: 0.666 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3118/ 128728 | consumed samples: 49888 | consumed tokens: 102170624 | elapsed time per iteration (s): 15.19 | learning rate: 1.635E-05 | global batch size: 16 | lm loss: 5.790831E+00 | grad norm: 0.653 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3119/ 128728 | consumed samples: 49904 | consumed tokens: 102203392 | elapsed time per iteration (s): 15.16 | learning rate: 1.635E-05 | global batch size: 16 | lm loss: 5.866699E+00 | grad norm: 0.836 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3120/ 128728 | consumed samples: 49920 | consumed tokens: 102236160 | elapsed time per iteration (s): 15.23 | learning rate: 1.636E-05 | global batch size: 16 | lm loss: 5.997011E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3121/ 128728 | consumed samples: 49936 | consumed tokens: 102268928 | elapsed time per iteration (s): 15.19 | learning rate: 1.636E-05 | global batch size: 16 | lm loss: 5.930976E+00 | grad norm: 0.983 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3122/ 128728 | consumed samples: 49952 | consumed tokens: 102301696 | elapsed time per iteration (s): 15.17 | learning rate: 1.637E-05 | global batch size: 16 | lm loss: 5.875608E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3123/ 128728 | consumed samples: 49968 | consumed tokens: 102334464 | elapsed time per iteration (s): 15.16 | learning rate: 1.637E-05 | global batch size: 16 | lm loss: 5.796740E+00 | grad norm: 1.342 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3124/ 128728 | consumed samples: 49984 | consumed tokens: 102367232 | elapsed time per iteration (s): 15.16 | learning rate: 1.638E-05 | global batch size: 16 | lm loss: 5.692341E+00 | grad norm: 0.783 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3125/ 128728 | consumed samples: 50000 | consumed tokens: 102400000 | elapsed time per iteration (s): 15.20 | learning rate: 1.638E-05 | global batch size: 16 | lm loss: 5.906222E+00 | grad norm: 0.697 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3126/ 128728 | consumed samples: 50016 | consumed tokens: 102432768 | elapsed time per iteration (s): 15.15 | learning rate: 1.639E-05 | global batch size: 16 | lm loss: 5.771677E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 3127/ 128728 | consumed samples: 50032 | consumed tokens: 102465536 | elapsed time per iteration (s): 15.21 | learning rate: 1.639E-05 | global batch size: 16 | lm loss: 5.853363E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3128/ 128728 | consumed samples: 50048 | consumed tokens: 102498304 | elapsed time per iteration (s): 15.13 | learning rate: 1.640E-05 | global batch size: 16 | lm loss: 5.964828E+00 | grad norm: 0.750 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.10 | [default7]: iteration 3129/ 128728 | consumed samples: 50064 | consumed tokens: 102531072 | elapsed time per iteration (s): 15.21 | learning rate: 1.641E-05 | global batch size: 16 | lm loss: 5.986765E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3130/ 128728 | consumed samples: 50080 | consumed tokens: 102563840 | elapsed time per iteration (s): 15.13 | learning rate: 1.641E-05 | global batch size: 16 | lm loss: 5.758943E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.058 | TFLOPs: 8.10 | [default7]: iteration 3131/ 128728 | consumed samples: 50096 | consumed tokens: 102596608 | elapsed time per iteration (s): 15.21 | learning rate: 1.642E-05 | global batch size: 16 | lm loss: 5.953258E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3132/ 128728 | consumed samples: 50112 | consumed tokens: 102629376 | elapsed time per iteration (s): 15.20 | learning rate: 1.642E-05 | global batch size: 16 | lm loss: 5.834059E+00 | grad norm: 0.686 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3133/ 128728 | consumed samples: 50128 | consumed tokens: 102662144 | elapsed time per iteration (s): 15.23 | learning rate: 1.643E-05 | global batch size: 16 | lm loss: 5.778453E+00 | grad norm: 0.640 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3134/ 128728 | consumed samples: 50144 | consumed tokens: 102694912 | elapsed time per iteration (s): 15.19 | learning rate: 1.643E-05 | global batch size: 16 | lm loss: 5.798711E+00 | grad norm: 0.791 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3135/ 128728 | consumed samples: 50160 | consumed tokens: 102727680 | elapsed time per iteration (s): 15.21 | learning rate: 1.644E-05 | global batch size: 16 | lm loss: 5.807882E+00 | grad norm: 0.744 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3136/ 128728 | consumed samples: 50176 | consumed tokens: 102760448 | elapsed time per iteration (s): 15.14 | learning rate: 1.644E-05 | global batch size: 16 | lm loss: 5.784853E+00 | grad norm: 0.666 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 3137/ 128728 | consumed samples: 50192 | consumed tokens: 102793216 | elapsed time per iteration (s): 15.15 | learning rate: 1.645E-05 | global batch size: 16 | lm loss: 5.705042E+00 | grad norm: 0.863 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 3138/ 128728 | consumed samples: 50208 | consumed tokens: 102825984 | elapsed time per iteration (s): 15.20 | learning rate: 1.645E-05 | global batch size: 16 | lm loss: 5.907452E+00 | grad norm: 0.692 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3139/ 128728 | consumed samples: 50224 | consumed tokens: 102858752 | elapsed time per iteration (s): 15.13 | learning rate: 1.646E-05 | global batch size: 16 | lm loss: 6.042287E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.10 | [default7]: iteration 3140/ 128728 | consumed samples: 50240 | consumed tokens: 102891520 | elapsed time per iteration (s): 15.23 | learning rate: 1.646E-05 | global batch size: 16 | lm loss: 5.736620E+00 | grad norm: 0.843 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3141/ 128728 | consumed samples: 50256 | consumed tokens: 102924288 | elapsed time per iteration (s): 15.22 | learning rate: 1.647E-05 | global batch size: 16 | lm loss: 6.033116E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3142/ 128728 | consumed samples: 50272 | consumed tokens: 102957056 | elapsed time per iteration (s): 15.17 | learning rate: 1.647E-05 | global batch size: 16 | lm loss: 5.729618E+00 | grad norm: 0.664 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3143/ 128728 | consumed samples: 50288 | consumed tokens: 102989824 | elapsed time per iteration (s): 15.19 | learning rate: 1.648E-05 | global batch size: 16 | lm loss: 5.883410E+00 | grad norm: 0.685 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3144/ 128728 | consumed samples: 50304 | consumed tokens: 103022592 | elapsed time per iteration (s): 15.21 | learning rate: 1.648E-05 | global batch size: 16 | lm loss: 5.754305E+00 | grad norm: 0.914 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3145/ 128728 | consumed samples: 50320 | consumed tokens: 103055360 | elapsed time per iteration (s): 15.21 | learning rate: 1.649E-05 | global batch size: 16 | lm loss: 5.893435E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3146/ 128728 | consumed samples: 50336 | consumed tokens: 103088128 | elapsed time per iteration (s): 15.20 | learning rate: 1.649E-05 | global batch size: 16 | lm loss: 5.840903E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3147/ 128728 | consumed samples: 50352 | consumed tokens: 103120896 | elapsed time per iteration (s): 15.25 | learning rate: 1.650E-05 | global batch size: 16 | lm loss: 5.732727E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3148/ 128728 | consumed samples: 50368 | consumed tokens: 103153664 | elapsed time per iteration (s): 15.22 | learning rate: 1.650E-05 | global batch size: 16 | lm loss: 6.073945E+00 | grad norm: 0.890 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3149/ 128728 | consumed samples: 50384 | consumed tokens: 103186432 | elapsed time per iteration (s): 15.14 | learning rate: 1.651E-05 | global batch size: 16 | lm loss: 5.885465E+00 | grad norm: 0.811 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 3150/ 128728 | consumed samples: 50400 | consumed tokens: 103219200 | elapsed time per iteration (s): 15.18 | learning rate: 1.652E-05 | global batch size: 16 | lm loss: 5.783937E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3151/ 128728 | consumed samples: 50416 | consumed tokens: 103251968 | elapsed time per iteration (s): 15.15 | learning rate: 1.652E-05 | global batch size: 16 | lm loss: 5.913184E+00 | grad norm: 0.962 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 3152/ 128728 | consumed samples: 50432 | consumed tokens: 103284736 | elapsed time per iteration (s): 15.22 | learning rate: 1.653E-05 | global batch size: 16 | lm loss: 5.823668E+00 | grad norm: 0.674 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3153/ 128728 | consumed samples: 50448 | consumed tokens: 103317504 | elapsed time per iteration (s): 15.14 | learning rate: 1.653E-05 | global batch size: 16 | lm loss: 5.867479E+00 | grad norm: 0.936 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 3154/ 128728 | consumed samples: 50464 | consumed tokens: 103350272 | elapsed time per iteration (s): 15.16 | learning rate: 1.654E-05 | global batch size: 16 | lm loss: 5.772203E+00 | grad norm: 0.876 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 3155/ 128728 | consumed samples: 50480 | consumed tokens: 103383040 | elapsed time per iteration (s): 15.16 | learning rate: 1.654E-05 | global batch size: 16 | lm loss: 5.679266E+00 | grad norm: 1.483 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3156/ 128728 | consumed samples: 50496 | consumed tokens: 103415808 | elapsed time per iteration (s): 15.22 | learning rate: 1.655E-05 | global batch size: 16 | lm loss: 5.798676E+00 | grad norm: 0.942 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3157/ 128728 | consumed samples: 50512 | consumed tokens: 103448576 | elapsed time per iteration (s): 15.16 | learning rate: 1.655E-05 | global batch size: 16 | lm loss: 5.913177E+00 | grad norm: 1.212 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3158/ 128728 | consumed samples: 50528 | consumed tokens: 103481344 | elapsed time per iteration (s): 15.17 | learning rate: 1.656E-05 | global batch size: 16 | lm loss: 5.806971E+00 | grad norm: 0.703 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3159/ 128728 | consumed samples: 50544 | consumed tokens: 103514112 | elapsed time per iteration (s): 15.24 | learning rate: 1.656E-05 | global batch size: 16 | lm loss: 5.890893E+00 | grad norm: 0.779 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3160/ 128728 | consumed samples: 50560 | consumed tokens: 103546880 | elapsed time per iteration (s): 15.22 | learning rate: 1.657E-05 | global batch size: 16 | lm loss: 5.810333E+00 | grad norm: 1.109 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3161/ 128728 | consumed samples: 50576 | consumed tokens: 103579648 | elapsed time per iteration (s): 15.14 | learning rate: 1.657E-05 | global batch size: 16 | lm loss: 5.901513E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 3162/ 128728 | consumed samples: 50592 | consumed tokens: 103612416 | elapsed time per iteration (s): 15.22 | learning rate: 1.658E-05 | global batch size: 16 | lm loss: 5.824885E+00 | grad norm: 0.798 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3163/ 128728 | consumed samples: 50608 | consumed tokens: 103645184 | elapsed time per iteration (s): 15.20 | learning rate: 1.658E-05 | global batch size: 16 | lm loss: 5.806005E+00 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3164/ 128728 | consumed samples: 50624 | consumed tokens: 103677952 | elapsed time per iteration (s): 15.21 | learning rate: 1.659E-05 | global batch size: 16 | lm loss: 5.998919E+00 | grad norm: 0.893 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3165/ 128728 | consumed samples: 50640 | consumed tokens: 103710720 | elapsed time per iteration (s): 15.20 | learning rate: 1.659E-05 | global batch size: 16 | lm loss: 5.667655E+00 | grad norm: 0.808 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3166/ 128728 | consumed samples: 50656 | consumed tokens: 103743488 | elapsed time per iteration (s): 15.20 | learning rate: 1.660E-05 | global batch size: 16 | lm loss: 5.927030E+00 | grad norm: 0.983 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3167/ 128728 | consumed samples: 50672 | consumed tokens: 103776256 | elapsed time per iteration (s): 15.22 | learning rate: 1.660E-05 | global batch size: 16 | lm loss: 5.922341E+00 | grad norm: 0.807 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3168/ 128728 | consumed samples: 50688 | consumed tokens: 103809024 | elapsed time per iteration (s): 15.20 | learning rate: 1.661E-05 | global batch size: 16 | lm loss: 5.802799E+00 | grad norm: 0.894 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3169/ 128728 | consumed samples: 50704 | consumed tokens: 103841792 | elapsed time per iteration (s): 15.20 | learning rate: 1.661E-05 | global batch size: 16 | lm loss: 5.817975E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3170/ 128728 | consumed samples: 50720 | consumed tokens: 103874560 | elapsed time per iteration (s): 15.18 | learning rate: 1.662E-05 | global batch size: 16 | lm loss: 6.009351E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3171/ 128728 | consumed samples: 50736 | consumed tokens: 103907328 | elapsed time per iteration (s): 15.19 | learning rate: 1.663E-05 | global batch size: 16 | lm loss: 5.650498E+00 | grad norm: 0.755 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3172/ 128728 | consumed samples: 50752 | consumed tokens: 103940096 | elapsed time per iteration (s): 15.17 | learning rate: 1.663E-05 | global batch size: 16 | lm loss: 5.935712E+00 | grad norm: 0.898 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3173/ 128728 | consumed samples: 50768 | consumed tokens: 103972864 | elapsed time per iteration (s): 15.22 | learning rate: 1.664E-05 | global batch size: 16 | lm loss: 5.931666E+00 | grad norm: 0.948 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3174/ 128728 | consumed samples: 50784 | consumed tokens: 104005632 | elapsed time per iteration (s): 15.19 | learning rate: 1.664E-05 | global batch size: 16 | lm loss: 5.748640E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3175/ 128728 | consumed samples: 50800 | consumed tokens: 104038400 | elapsed time per iteration (s): 15.19 | learning rate: 1.665E-05 | global batch size: 16 | lm loss: 5.910668E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 3176/ 128728 | consumed samples: 50816 | consumed tokens: 104071168 | elapsed time per iteration (s): 15.22 | learning rate: 1.665E-05 | global batch size: 16 | lm loss: 5.654323E+00 | grad norm: 1.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3177/ 128728 | consumed samples: 50832 | consumed tokens: 104103936 | elapsed time per iteration (s): 15.14 | learning rate: 1.666E-05 | global batch size: 16 | lm loss: 5.842155E+00 | grad norm: 0.699 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 3178/ 128728 | consumed samples: 50848 | consumed tokens: 104136704 | elapsed time per iteration (s): 15.21 | learning rate: 1.666E-05 | global batch size: 16 | lm loss: 5.938166E+00 | grad norm: 0.997 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3179/ 128728 | consumed samples: 50864 | consumed tokens: 104169472 | elapsed time per iteration (s): 15.21 | learning rate: 1.667E-05 | global batch size: 16 | lm loss: 5.896249E+00 | grad norm: 0.930 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3180/ 128728 | consumed samples: 50880 | consumed tokens: 104202240 | elapsed time per iteration (s): 15.19 | learning rate: 1.667E-05 | global batch size: 16 | lm loss: 5.735763E+00 | grad norm: 0.705 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 3181/ 128728 | consumed samples: 50896 | consumed tokens: 104235008 | elapsed time per iteration (s): 15.21 | learning rate: 1.668E-05 | global batch size: 16 | lm loss: 6.049779E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3182/ 128728 | consumed samples: 50912 | consumed tokens: 104267776 | elapsed time per iteration (s): 15.21 | learning rate: 1.668E-05 | global batch size: 16 | lm loss: 5.771222E+00 | grad norm: 0.664 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3183/ 128728 | consumed samples: 50928 | consumed tokens: 104300544 | elapsed time per iteration (s): 15.21 | learning rate: 1.669E-05 | global batch size: 16 | lm loss: 6.001236E+00 | grad norm: 0.705 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3184/ 128728 | consumed samples: 50944 | consumed tokens: 104333312 | elapsed time per iteration (s): 15.20 | learning rate: 1.669E-05 | global batch size: 16 | lm loss: 5.754385E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3185/ 128728 | consumed samples: 50960 | consumed tokens: 104366080 | elapsed time per iteration (s): 15.21 | learning rate: 1.670E-05 | global batch size: 16 | lm loss: 5.952290E+00 | grad norm: 0.778 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3186/ 128728 | consumed samples: 50976 | consumed tokens: 104398848 | elapsed time per iteration (s): 15.22 | learning rate: 1.670E-05 | global batch size: 16 | lm loss: 5.944228E+00 | grad norm: 1.274 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3187/ 128728 | consumed samples: 50992 | consumed tokens: 104431616 | elapsed time per iteration (s): 15.18 | learning rate: 1.671E-05 | global batch size: 16 | lm loss: 5.856114E+00 | grad norm: 0.662 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3188/ 128728 | consumed samples: 51008 | consumed tokens: 104464384 | elapsed time per iteration (s): 15.21 | learning rate: 1.671E-05 | global batch size: 16 | lm loss: 5.799392E+00 | grad norm: 0.639 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3189/ 128728 | consumed samples: 51024 | consumed tokens: 104497152 | elapsed time per iteration (s): 15.18 | learning rate: 1.672E-05 | global batch size: 16 | lm loss: 5.693764E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3190/ 128728 | consumed samples: 51040 | consumed tokens: 104529920 | elapsed time per iteration (s): 15.22 | learning rate: 1.672E-05 | global batch size: 16 | lm loss: 5.993411E+00 | grad norm: 0.841 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3191/ 128728 | consumed samples: 51056 | consumed tokens: 104562688 | elapsed time per iteration (s): 15.20 | learning rate: 1.673E-05 | global batch size: 16 | lm loss: 5.842443E+00 | grad norm: 0.688 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3192/ 128728 | consumed samples: 51072 | consumed tokens: 104595456 | elapsed time per iteration (s): 15.22 | learning rate: 1.674E-05 | global batch size: 16 | lm loss: 5.879288E+00 | grad norm: 0.679 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3193/ 128728 | consumed samples: 51088 | consumed tokens: 104628224 | elapsed time per iteration (s): 15.21 | learning rate: 1.674E-05 | global batch size: 16 | lm loss: 5.917938E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3194/ 128728 | consumed samples: 51104 | consumed tokens: 104660992 | elapsed time per iteration (s): 15.20 | learning rate: 1.675E-05 | global batch size: 16 | lm loss: 5.804705E+00 | grad norm: 0.832 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3195/ 128728 | consumed samples: 51120 | consumed tokens: 104693760 | elapsed time per iteration (s): 15.18 | learning rate: 1.675E-05 | global batch size: 16 | lm loss: 5.770677E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3196/ 128728 | consumed samples: 51136 | consumed tokens: 104726528 | elapsed time per iteration (s): 15.22 | learning rate: 1.676E-05 | global batch size: 16 | lm loss: 5.813903E+00 | grad norm: 0.642 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3197/ 128728 | consumed samples: 51152 | consumed tokens: 104759296 | elapsed time per iteration (s): 15.20 | learning rate: 1.676E-05 | global batch size: 16 | lm loss: 5.794953E+00 | grad norm: 1.219 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3198/ 128728 | consumed samples: 51168 | consumed tokens: 104792064 | elapsed time per iteration (s): 15.21 | learning rate: 1.677E-05 | global batch size: 16 | lm loss: 5.620133E+00 | grad norm: 0.915 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3199/ 128728 | consumed samples: 51184 | consumed tokens: 104824832 | elapsed time per iteration (s): 15.20 | learning rate: 1.677E-05 | global batch size: 16 | lm loss: 5.942338E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3200/ 128728 | consumed samples: 51200 | consumed tokens: 104857600 | elapsed time per iteration (s): 15.23 | learning rate: 1.678E-05 | global batch size: 16 | lm loss: 5.729494E+00 | grad norm: 0.644 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3201/ 128728 | consumed samples: 51216 | consumed tokens: 104890368 | elapsed time per iteration (s): 15.19 | learning rate: 1.678E-05 | global batch size: 16 | lm loss: 5.862929E+00 | grad norm: 0.652 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3202/ 128728 | consumed samples: 51232 | consumed tokens: 104923136 | elapsed time per iteration (s): 15.20 | learning rate: 1.679E-05 | global batch size: 16 | lm loss: 5.847036E+00 | grad norm: 0.833 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3203/ 128728 | consumed samples: 51248 | consumed tokens: 104955904 | elapsed time per iteration (s): 15.21 | learning rate: 1.679E-05 | global batch size: 16 | lm loss: 5.800924E+00 | grad norm: 0.678 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3204/ 128728 | consumed samples: 51264 | consumed tokens: 104988672 | elapsed time per iteration (s): 15.16 | learning rate: 1.680E-05 | global batch size: 16 | lm loss: 5.901340E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 3205/ 128728 | consumed samples: 51280 | consumed tokens: 105021440 | elapsed time per iteration (s): 15.21 | learning rate: 1.680E-05 | global batch size: 16 | lm loss: 5.704348E+00 | grad norm: 0.708 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3206/ 128728 | consumed samples: 51296 | consumed tokens: 105054208 | elapsed time per iteration (s): 15.18 | learning rate: 1.681E-05 | global batch size: 16 | lm loss: 5.754029E+00 | grad norm: 0.658 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3207/ 128728 | consumed samples: 51312 | consumed tokens: 105086976 | elapsed time per iteration (s): 15.21 | learning rate: 1.681E-05 | global batch size: 16 | lm loss: 5.820123E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3208/ 128728 | consumed samples: 51328 | consumed tokens: 105119744 | elapsed time per iteration (s): 15.22 | learning rate: 1.682E-05 | global batch size: 16 | lm loss: 5.841055E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3209/ 128728 | consumed samples: 51344 | consumed tokens: 105152512 | elapsed time per iteration (s): 15.26 | learning rate: 1.682E-05 | global batch size: 16 | lm loss: 5.840108E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3210/ 128728 | consumed samples: 51360 | consumed tokens: 105185280 | elapsed time per iteration (s): 15.22 | learning rate: 1.683E-05 | global batch size: 16 | lm loss: 5.684037E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3211/ 128728 | consumed samples: 51376 | consumed tokens: 105218048 | elapsed time per iteration (s): 15.22 | learning rate: 1.683E-05 | global batch size: 16 | lm loss: 5.864146E+00 | grad norm: 0.933 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3212/ 128728 | consumed samples: 51392 | consumed tokens: 105250816 | elapsed time per iteration (s): 15.23 | learning rate: 1.684E-05 | global batch size: 16 | lm loss: 5.662052E+00 | grad norm: 1.047 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3213/ 128728 | consumed samples: 51408 | consumed tokens: 105283584 | elapsed time per iteration (s): 15.22 | learning rate: 1.685E-05 | global batch size: 16 | lm loss: 5.930824E+00 | grad norm: 0.744 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3214/ 128728 | consumed samples: 51424 | consumed tokens: 105316352 | elapsed time per iteration (s): 15.23 | learning rate: 1.685E-05 | global batch size: 16 | lm loss: 5.820041E+00 | grad norm: 0.759 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3215/ 128728 | consumed samples: 51440 | consumed tokens: 105349120 | elapsed time per iteration (s): 15.24 | learning rate: 1.686E-05 | global batch size: 16 | lm loss: 5.921219E+00 | grad norm: 0.801 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3216/ 128728 | consumed samples: 51456 | consumed tokens: 105381888 | elapsed time per iteration (s): 15.23 | learning rate: 1.686E-05 | global batch size: 16 | lm loss: 5.814280E+00 | grad norm: 0.960 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3217/ 128728 | consumed samples: 51472 | consumed tokens: 105414656 | elapsed time per iteration (s): 15.21 | learning rate: 1.687E-05 | global batch size: 16 | lm loss: 5.856856E+00 | grad norm: 0.679 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3218/ 128728 | consumed samples: 51488 | consumed tokens: 105447424 | elapsed time per iteration (s): 15.23 | learning rate: 1.687E-05 | global batch size: 16 | lm loss: 5.942042E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3219/ 128728 | consumed samples: 51504 | consumed tokens: 105480192 | elapsed time per iteration (s): 15.22 | learning rate: 1.688E-05 | global batch size: 16 | lm loss: 5.818819E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3220/ 128728 | consumed samples: 51520 | consumed tokens: 105512960 | elapsed time per iteration (s): 15.22 | learning rate: 1.688E-05 | global batch size: 16 | lm loss: 5.934455E+00 | grad norm: 1.517 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3221/ 128728 | consumed samples: 51536 | consumed tokens: 105545728 | elapsed time per iteration (s): 15.24 | learning rate: 1.689E-05 | global batch size: 16 | lm loss: 5.632852E+00 | grad norm: 1.036 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3222/ 128728 | consumed samples: 51552 | consumed tokens: 105578496 | elapsed time per iteration (s): 15.17 | learning rate: 1.689E-05 | global batch size: 16 | lm loss: 5.690525E+00 | grad norm: 0.658 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3223/ 128728 | consumed samples: 51568 | consumed tokens: 105611264 | elapsed time per iteration (s): 15.26 | learning rate: 1.690E-05 | global batch size: 16 | lm loss: 5.435367E+00 | grad norm: 1.042 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 3224/ 128728 | consumed samples: 51584 | consumed tokens: 105644032 | elapsed time per iteration (s): 15.21 | learning rate: 1.690E-05 | global batch size: 16 | lm loss: 5.834442E+00 | grad norm: 0.804 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3225/ 128728 | consumed samples: 51600 | consumed tokens: 105676800 | elapsed time per iteration (s): 15.19 | learning rate: 1.691E-05 | global batch size: 16 | lm loss: 5.838341E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3226/ 128728 | consumed samples: 51616 | consumed tokens: 105709568 | elapsed time per iteration (s): 15.21 | learning rate: 1.691E-05 | global batch size: 16 | lm loss: 5.809447E+00 | grad norm: 0.750 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3227/ 128728 | consumed samples: 51632 | consumed tokens: 105742336 | elapsed time per iteration (s): 15.22 | learning rate: 1.692E-05 | global batch size: 16 | lm loss: 5.792805E+00 | grad norm: 0.776 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3228/ 128728 | consumed samples: 51648 | consumed tokens: 105775104 | elapsed time per iteration (s): 15.16 | learning rate: 1.692E-05 | global batch size: 16 | lm loss: 5.630265E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 3229/ 128728 | consumed samples: 51664 | consumed tokens: 105807872 | elapsed time per iteration (s): 15.27 | learning rate: 1.693E-05 | global batch size: 16 | lm loss: 5.785818E+00 | grad norm: 0.738 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 3230/ 128728 | consumed samples: 51680 | consumed tokens: 105840640 | elapsed time per iteration (s): 15.21 | learning rate: 1.693E-05 | global batch size: 16 | lm loss: 5.710336E+00 | grad norm: 0.869 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3231/ 128728 | consumed samples: 51696 | consumed tokens: 105873408 | elapsed time per iteration (s): 15.22 | learning rate: 1.694E-05 | global batch size: 16 | lm loss: 5.774018E+00 | grad norm: 0.842 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3232/ 128728 | consumed samples: 51712 | consumed tokens: 105906176 | elapsed time per iteration (s): 15.21 | learning rate: 1.695E-05 | global batch size: 16 | lm loss: 5.810544E+00 | grad norm: 0.886 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3233/ 128728 | consumed samples: 51728 | consumed tokens: 105938944 | elapsed time per iteration (s): 15.21 | learning rate: 1.695E-05 | global batch size: 16 | lm loss: 5.686558E+00 | grad norm: 0.642 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3234/ 128728 | consumed samples: 51744 | consumed tokens: 105971712 | elapsed time per iteration (s): 15.22 | learning rate: 1.696E-05 | global batch size: 16 | lm loss: 5.808766E+00 | grad norm: 0.935 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3235/ 128728 | consumed samples: 51760 | consumed tokens: 106004480 | elapsed time per iteration (s): 15.23 | learning rate: 1.696E-05 | global batch size: 16 | lm loss: 5.933078E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3236/ 128728 | consumed samples: 51776 | consumed tokens: 106037248 | elapsed time per iteration (s): 15.20 | learning rate: 1.697E-05 | global batch size: 16 | lm loss: 5.929778E+00 | grad norm: 0.721 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3237/ 128728 | consumed samples: 51792 | consumed tokens: 106070016 | elapsed time per iteration (s): 15.19 | learning rate: 1.697E-05 | global batch size: 16 | lm loss: 5.637609E+00 | grad norm: 0.659 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3238/ 128728 | consumed samples: 51808 | consumed tokens: 106102784 | elapsed time per iteration (s): 15.21 | learning rate: 1.698E-05 | global batch size: 16 | lm loss: 5.857882E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3239/ 128728 | consumed samples: 51824 | consumed tokens: 106135552 | elapsed time per iteration (s): 15.24 | learning rate: 1.698E-05 | global batch size: 16 | lm loss: 5.865059E+00 | grad norm: 0.705 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3240/ 128728 | consumed samples: 51840 | consumed tokens: 106168320 | elapsed time per iteration (s): 15.18 | learning rate: 1.699E-05 | global batch size: 16 | lm loss: 5.716511E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3241/ 128728 | consumed samples: 51856 | consumed tokens: 106201088 | elapsed time per iteration (s): 15.21 | learning rate: 1.699E-05 | global batch size: 16 | lm loss: 5.803041E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3242/ 128728 | consumed samples: 51872 | consumed tokens: 106233856 | elapsed time per iteration (s): 15.18 | learning rate: 1.700E-05 | global batch size: 16 | lm loss: 5.904123E+00 | grad norm: 0.651 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3243/ 128728 | consumed samples: 51888 | consumed tokens: 106266624 | elapsed time per iteration (s): 15.20 | learning rate: 1.700E-05 | global batch size: 16 | lm loss: 5.810658E+00 | grad norm: 0.720 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3244/ 128728 | consumed samples: 51904 | consumed tokens: 106299392 | elapsed time per iteration (s): 15.23 | learning rate: 1.701E-05 | global batch size: 16 | lm loss: 5.841102E+00 | grad norm: 0.896 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3245/ 128728 | consumed samples: 51920 | consumed tokens: 106332160 | elapsed time per iteration (s): 15.22 | learning rate: 1.701E-05 | global batch size: 16 | lm loss: 5.736031E+00 | grad norm: 0.787 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3246/ 128728 | consumed samples: 51936 | consumed tokens: 106364928 | elapsed time per iteration (s): 15.19 | learning rate: 1.702E-05 | global batch size: 16 | lm loss: 5.761059E+00 | grad norm: 0.739 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3247/ 128728 | consumed samples: 51952 | consumed tokens: 106397696 | elapsed time per iteration (s): 15.24 | learning rate: 1.702E-05 | global batch size: 16 | lm loss: 5.894554E+00 | grad norm: 0.910 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3248/ 128728 | consumed samples: 51968 | consumed tokens: 106430464 | elapsed time per iteration (s): 15.21 | learning rate: 1.703E-05 | global batch size: 16 | lm loss: 5.798692E+00 | grad norm: 0.648 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3249/ 128728 | consumed samples: 51984 | consumed tokens: 106463232 | elapsed time per iteration (s): 15.21 | learning rate: 1.703E-05 | global batch size: 16 | lm loss: 5.678707E+00 | grad norm: 1.109 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3250/ 128728 | consumed samples: 52000 | consumed tokens: 106496000 | elapsed time per iteration (s): 15.22 | learning rate: 1.704E-05 | global batch size: 16 | lm loss: 5.730203E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3251/ 128728 | consumed samples: 52016 | consumed tokens: 106528768 | elapsed time per iteration (s): 15.17 | learning rate: 1.704E-05 | global batch size: 16 | lm loss: 5.578306E+00 | grad norm: 0.910 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3252/ 128728 | consumed samples: 52032 | consumed tokens: 106561536 | elapsed time per iteration (s): 15.17 | learning rate: 1.705E-05 | global batch size: 16 | lm loss: 5.799627E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3253/ 128728 | consumed samples: 52048 | consumed tokens: 106594304 | elapsed time per iteration (s): 15.23 | learning rate: 1.706E-05 | global batch size: 16 | lm loss: 5.785791E+00 | grad norm: 0.828 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3254/ 128728 | consumed samples: 52064 | consumed tokens: 106627072 | elapsed time per iteration (s): 15.21 | learning rate: 1.706E-05 | global batch size: 16 | lm loss: 5.783490E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3255/ 128728 | consumed samples: 52080 | consumed tokens: 106659840 | elapsed time per iteration (s): 15.16 | learning rate: 1.707E-05 | global batch size: 16 | lm loss: 5.852077E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3256/ 128728 | consumed samples: 52096 | consumed tokens: 106692608 | elapsed time per iteration (s): 15.27 | learning rate: 1.707E-05 | global batch size: 16 | lm loss: 6.013089E+00 | grad norm: 0.940 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 3257/ 128728 | consumed samples: 52112 | consumed tokens: 106725376 | elapsed time per iteration (s): 15.19 | learning rate: 1.708E-05 | global batch size: 16 | lm loss: 5.874432E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3258/ 128728 | consumed samples: 52128 | consumed tokens: 106758144 | elapsed time per iteration (s): 15.23 | learning rate: 1.708E-05 | global batch size: 16 | lm loss: 5.740964E+00 | grad norm: 0.825 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3259/ 128728 | consumed samples: 52144 | consumed tokens: 106790912 | elapsed time per iteration (s): 15.24 | learning rate: 1.709E-05 | global batch size: 16 | lm loss: 5.593841E+00 | grad norm: 0.968 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3260/ 128728 | consumed samples: 52160 | consumed tokens: 106823680 | elapsed time per iteration (s): 15.20 | learning rate: 1.709E-05 | global batch size: 16 | lm loss: 5.618110E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3261/ 128728 | consumed samples: 52176 | consumed tokens: 106856448 | elapsed time per iteration (s): 15.20 | learning rate: 1.710E-05 | global batch size: 16 | lm loss: 5.856056E+00 | grad norm: 0.700 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3262/ 128728 | consumed samples: 52192 | consumed tokens: 106889216 | elapsed time per iteration (s): 15.22 | learning rate: 1.710E-05 | global batch size: 16 | lm loss: 5.848229E+00 | grad norm: 0.785 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3263/ 128728 | consumed samples: 52208 | consumed tokens: 106921984 | elapsed time per iteration (s): 15.23 | learning rate: 1.711E-05 | global batch size: 16 | lm loss: 5.979393E+00 | grad norm: 0.825 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3264/ 128728 | consumed samples: 52224 | consumed tokens: 106954752 | elapsed time per iteration (s): 15.22 | learning rate: 1.711E-05 | global batch size: 16 | lm loss: 5.691633E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3265/ 128728 | consumed samples: 52240 | consumed tokens: 106987520 | elapsed time per iteration (s): 15.23 | learning rate: 1.712E-05 | global batch size: 16 | lm loss: 5.807378E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3266/ 128728 | consumed samples: 52256 | consumed tokens: 107020288 | elapsed time per iteration (s): 15.23 | learning rate: 1.712E-05 | global batch size: 16 | lm loss: 5.705748E+00 | grad norm: 0.882 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3267/ 128728 | consumed samples: 52272 | consumed tokens: 107053056 | elapsed time per iteration (s): 15.25 | learning rate: 1.713E-05 | global batch size: 16 | lm loss: 5.653453E+00 | grad norm: 0.765 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3268/ 128728 | consumed samples: 52288 | consumed tokens: 107085824 | elapsed time per iteration (s): 15.23 | learning rate: 1.713E-05 | global batch size: 16 | lm loss: 5.936657E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3269/ 128728 | consumed samples: 52304 | consumed tokens: 107118592 | elapsed time per iteration (s): 15.20 | learning rate: 1.714E-05 | global batch size: 16 | lm loss: 5.710529E+00 | grad norm: 1.321 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3270/ 128728 | consumed samples: 52320 | consumed tokens: 107151360 | elapsed time per iteration (s): 15.17 | learning rate: 1.714E-05 | global batch size: 16 | lm loss: 5.695917E+00 | grad norm: 0.720 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3271/ 128728 | consumed samples: 52336 | consumed tokens: 107184128 | elapsed time per iteration (s): 15.24 | learning rate: 1.715E-05 | global batch size: 16 | lm loss: 5.680094E+00 | grad norm: 0.953 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3272/ 128728 | consumed samples: 52352 | consumed tokens: 107216896 | elapsed time per iteration (s): 15.20 | learning rate: 1.715E-05 | global batch size: 16 | lm loss: 5.884488E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3273/ 128728 | consumed samples: 52368 | consumed tokens: 107249664 | elapsed time per iteration (s): 15.24 | learning rate: 1.716E-05 | global batch size: 16 | lm loss: 5.788309E+00 | grad norm: 0.840 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3274/ 128728 | consumed samples: 52384 | consumed tokens: 107282432 | elapsed time per iteration (s): 15.24 | learning rate: 1.717E-05 | global batch size: 16 | lm loss: 5.728816E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3275/ 128728 | consumed samples: 52400 | consumed tokens: 107315200 | elapsed time per iteration (s): 15.23 | learning rate: 1.717E-05 | global batch size: 16 | lm loss: 6.045995E+00 | grad norm: 0.897 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3276/ 128728 | consumed samples: 52416 | consumed tokens: 107347968 | elapsed time per iteration (s): 15.21 | learning rate: 1.718E-05 | global batch size: 16 | lm loss: 5.863292E+00 | grad norm: 0.659 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3277/ 128728 | consumed samples: 52432 | consumed tokens: 107380736 | elapsed time per iteration (s): 15.21 | learning rate: 1.718E-05 | global batch size: 16 | lm loss: 5.678338E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3278/ 128728 | consumed samples: 52448 | consumed tokens: 107413504 | elapsed time per iteration (s): 15.16 | learning rate: 1.719E-05 | global batch size: 16 | lm loss: 5.855639E+00 | grad norm: 0.818 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3279/ 128728 | consumed samples: 52464 | consumed tokens: 107446272 | elapsed time per iteration (s): 15.22 | learning rate: 1.719E-05 | global batch size: 16 | lm loss: 5.804471E+00 | grad norm: 0.711 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3280/ 128728 | consumed samples: 52480 | consumed tokens: 107479040 | elapsed time per iteration (s): 15.22 | learning rate: 1.720E-05 | global batch size: 16 | lm loss: 5.617855E+00 | grad norm: 0.703 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3281/ 128728 | consumed samples: 52496 | consumed tokens: 107511808 | elapsed time per iteration (s): 15.19 | learning rate: 1.720E-05 | global batch size: 16 | lm loss: 5.743747E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3282/ 128728 | consumed samples: 52512 | consumed tokens: 107544576 | elapsed time per iteration (s): 15.19 | learning rate: 1.721E-05 | global batch size: 16 | lm loss: 5.869383E+00 | grad norm: 0.697 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3283/ 128728 | consumed samples: 52528 | consumed tokens: 107577344 | elapsed time per iteration (s): 15.18 | learning rate: 1.721E-05 | global batch size: 16 | lm loss: 5.538039E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3284/ 128728 | consumed samples: 52544 | consumed tokens: 107610112 | elapsed time per iteration (s): 15.23 | learning rate: 1.722E-05 | global batch size: 16 | lm loss: 5.996184E+00 | grad norm: 0.924 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3285/ 128728 | consumed samples: 52560 | consumed tokens: 107642880 | elapsed time per iteration (s): 15.17 | learning rate: 1.722E-05 | global batch size: 16 | lm loss: 5.756711E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3286/ 128728 | consumed samples: 52576 | consumed tokens: 107675648 | elapsed time per iteration (s): 15.18 | learning rate: 1.723E-05 | global batch size: 16 | lm loss: 5.927887E+00 | grad norm: 1.021 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3287/ 128728 | consumed samples: 52592 | consumed tokens: 107708416 | elapsed time per iteration (s): 15.21 | learning rate: 1.723E-05 | global batch size: 16 | lm loss: 5.704397E+00 | grad norm: 0.635 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3288/ 128728 | consumed samples: 52608 | consumed tokens: 107741184 | elapsed time per iteration (s): 15.19 | learning rate: 1.724E-05 | global batch size: 16 | lm loss: 5.545193E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3289/ 128728 | consumed samples: 52624 | consumed tokens: 107773952 | elapsed time per iteration (s): 15.20 | learning rate: 1.724E-05 | global batch size: 16 | lm loss: 5.826765E+00 | grad norm: 0.802 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3290/ 128728 | consumed samples: 52640 | consumed tokens: 107806720 | elapsed time per iteration (s): 15.20 | learning rate: 1.725E-05 | global batch size: 16 | lm loss: 5.701634E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3291/ 128728 | consumed samples: 52656 | consumed tokens: 107839488 | elapsed time per iteration (s): 15.19 | learning rate: 1.725E-05 | global batch size: 16 | lm loss: 5.741204E+00 | grad norm: 0.751 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3292/ 128728 | consumed samples: 52672 | consumed tokens: 107872256 | elapsed time per iteration (s): 15.21 | learning rate: 1.726E-05 | global batch size: 16 | lm loss: 5.751829E+00 | grad norm: 0.759 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3293/ 128728 | consumed samples: 52688 | consumed tokens: 107905024 | elapsed time per iteration (s): 15.16 | learning rate: 1.726E-05 | global batch size: 16 | lm loss: 5.917647E+00 | grad norm: 0.843 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 3294/ 128728 | consumed samples: 52704 | consumed tokens: 107937792 | elapsed time per iteration (s): 15.25 | learning rate: 1.727E-05 | global batch size: 16 | lm loss: 5.593085E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3295/ 128728 | consumed samples: 52720 | consumed tokens: 107970560 | elapsed time per iteration (s): 15.19 | learning rate: 1.728E-05 | global batch size: 16 | lm loss: 5.778680E+00 | grad norm: 0.736 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3296/ 128728 | consumed samples: 52736 | consumed tokens: 108003328 | elapsed time per iteration (s): 15.23 | learning rate: 1.728E-05 | global batch size: 16 | lm loss: 5.817701E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3297/ 128728 | consumed samples: 52752 | consumed tokens: 108036096 | elapsed time per iteration (s): 15.23 | learning rate: 1.729E-05 | global batch size: 16 | lm loss: 5.779237E+00 | grad norm: 1.102 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3298/ 128728 | consumed samples: 52768 | consumed tokens: 108068864 | elapsed time per iteration (s): 15.23 | learning rate: 1.729E-05 | global batch size: 16 | lm loss: 5.603976E+00 | grad norm: 0.739 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3299/ 128728 | consumed samples: 52784 | consumed tokens: 108101632 | elapsed time per iteration (s): 15.23 | learning rate: 1.730E-05 | global batch size: 16 | lm loss: 5.524374E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3300/ 128728 | consumed samples: 52800 | consumed tokens: 108134400 | elapsed time per iteration (s): 15.19 | learning rate: 1.730E-05 | global batch size: 16 | lm loss: 5.887682E+00 | grad norm: 0.849 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3301/ 128728 | consumed samples: 52816 | consumed tokens: 108167168 | elapsed time per iteration (s): 15.21 | learning rate: 1.731E-05 | global batch size: 16 | lm loss: 5.713980E+00 | grad norm: 0.904 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3302/ 128728 | consumed samples: 52832 | consumed tokens: 108199936 | elapsed time per iteration (s): 15.22 | learning rate: 1.731E-05 | global batch size: 16 | lm loss: 5.805495E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3303/ 128728 | consumed samples: 52848 | consumed tokens: 108232704 | elapsed time per iteration (s): 15.22 | learning rate: 1.732E-05 | global batch size: 16 | lm loss: 5.778564E+00 | grad norm: 0.708 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3304/ 128728 | consumed samples: 52864 | consumed tokens: 108265472 | elapsed time per iteration (s): 15.19 | learning rate: 1.732E-05 | global batch size: 16 | lm loss: 5.578158E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3305/ 128728 | consumed samples: 52880 | consumed tokens: 108298240 | elapsed time per iteration (s): 15.23 | learning rate: 1.733E-05 | global batch size: 16 | lm loss: 5.771214E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3306/ 128728 | consumed samples: 52896 | consumed tokens: 108331008 | elapsed time per iteration (s): 15.20 | learning rate: 1.733E-05 | global batch size: 16 | lm loss: 5.839641E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3307/ 128728 | consumed samples: 52912 | consumed tokens: 108363776 | elapsed time per iteration (s): 15.21 | learning rate: 1.734E-05 | global batch size: 16 | lm loss: 5.654119E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3308/ 128728 | consumed samples: 52928 | consumed tokens: 108396544 | elapsed time per iteration (s): 15.18 | learning rate: 1.734E-05 | global batch size: 16 | lm loss: 5.570100E+00 | grad norm: 0.705 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3309/ 128728 | consumed samples: 52944 | consumed tokens: 108429312 | elapsed time per iteration (s): 15.18 | learning rate: 1.735E-05 | global batch size: 16 | lm loss: 5.901294E+00 | grad norm: 0.711 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3310/ 128728 | consumed samples: 52960 | consumed tokens: 108462080 | elapsed time per iteration (s): 15.22 | learning rate: 1.735E-05 | global batch size: 16 | lm loss: 5.962369E+00 | grad norm: 0.803 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3311/ 128728 | consumed samples: 52976 | consumed tokens: 108494848 | elapsed time per iteration (s): 15.20 | learning rate: 1.736E-05 | global batch size: 16 | lm loss: 5.723049E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3312/ 128728 | consumed samples: 52992 | consumed tokens: 108527616 | elapsed time per iteration (s): 15.27 | learning rate: 1.736E-05 | global batch size: 16 | lm loss: 5.923474E+00 | grad norm: 2.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 3313/ 128728 | consumed samples: 53008 | consumed tokens: 108560384 | elapsed time per iteration (s): 15.22 | learning rate: 1.737E-05 | global batch size: 16 | lm loss: 5.845633E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3314/ 128728 | consumed samples: 53024 | consumed tokens: 108593152 | elapsed time per iteration (s): 15.23 | learning rate: 1.737E-05 | global batch size: 16 | lm loss: 5.823073E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3315/ 128728 | consumed samples: 53040 | consumed tokens: 108625920 | elapsed time per iteration (s): 15.22 | learning rate: 1.738E-05 | global batch size: 16 | lm loss: 5.727423E+00 | grad norm: 0.937 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3316/ 128728 | consumed samples: 53056 | consumed tokens: 108658688 | elapsed time per iteration (s): 15.22 | learning rate: 1.739E-05 | global batch size: 16 | lm loss: 5.628917E+00 | grad norm: 0.672 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3317/ 128728 | consumed samples: 53072 | consumed tokens: 108691456 | elapsed time per iteration (s): 15.21 | learning rate: 1.739E-05 | global batch size: 16 | lm loss: 5.650199E+00 | grad norm: 0.646 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3318/ 128728 | consumed samples: 53088 | consumed tokens: 108724224 | elapsed time per iteration (s): 15.21 | learning rate: 1.740E-05 | global batch size: 16 | lm loss: 5.799413E+00 | grad norm: 0.677 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3319/ 128728 | consumed samples: 53104 | consumed tokens: 108756992 | elapsed time per iteration (s): 15.23 | learning rate: 1.740E-05 | global batch size: 16 | lm loss: 5.819259E+00 | grad norm: 0.792 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3320/ 128728 | consumed samples: 53120 | consumed tokens: 108789760 | elapsed time per iteration (s): 15.16 | learning rate: 1.741E-05 | global batch size: 16 | lm loss: 5.719065E+00 | grad norm: 0.635 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 3321/ 128728 | consumed samples: 53136 | consumed tokens: 108822528 | elapsed time per iteration (s): 15.21 | learning rate: 1.741E-05 | global batch size: 16 | lm loss: 5.814806E+00 | grad norm: 0.672 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3322/ 128728 | consumed samples: 53152 | consumed tokens: 108855296 | elapsed time per iteration (s): 15.23 | learning rate: 1.742E-05 | global batch size: 16 | lm loss: 5.729675E+00 | grad norm: 0.757 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3323/ 128728 | consumed samples: 53168 | consumed tokens: 108888064 | elapsed time per iteration (s): 15.21 | learning rate: 1.742E-05 | global batch size: 16 | lm loss: 5.674429E+00 | grad norm: 1.007 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3324/ 128728 | consumed samples: 53184 | consumed tokens: 108920832 | elapsed time per iteration (s): 15.20 | learning rate: 1.743E-05 | global batch size: 16 | lm loss: 5.645885E+00 | grad norm: 0.664 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3325/ 128728 | consumed samples: 53200 | consumed tokens: 108953600 | elapsed time per iteration (s): 15.23 | learning rate: 1.743E-05 | global batch size: 16 | lm loss: 5.516932E+00 | grad norm: 0.951 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3326/ 128728 | consumed samples: 53216 | consumed tokens: 108986368 | elapsed time per iteration (s): 15.15 | learning rate: 1.744E-05 | global batch size: 16 | lm loss: 5.534013E+00 | grad norm: 0.968 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 3327/ 128728 | consumed samples: 53232 | consumed tokens: 109019136 | elapsed time per iteration (s): 15.17 | learning rate: 1.744E-05 | global batch size: 16 | lm loss: 5.667064E+00 | grad norm: 0.679 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3328/ 128728 | consumed samples: 53248 | consumed tokens: 109051904 | elapsed time per iteration (s): 15.21 | learning rate: 1.745E-05 | global batch size: 16 | lm loss: 5.748591E+00 | grad norm: 1.036 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3329/ 128728 | consumed samples: 53264 | consumed tokens: 109084672 | elapsed time per iteration (s): 15.24 | learning rate: 1.745E-05 | global batch size: 16 | lm loss: 5.727609E+00 | grad norm: 0.772 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3330/ 128728 | consumed samples: 53280 | consumed tokens: 109117440 | elapsed time per iteration (s): 15.22 | learning rate: 1.746E-05 | global batch size: 16 | lm loss: 5.723650E+00 | grad norm: 0.817 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3331/ 128728 | consumed samples: 53296 | consumed tokens: 109150208 | elapsed time per iteration (s): 15.25 | learning rate: 1.746E-05 | global batch size: 16 | lm loss: 5.739835E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3332/ 128728 | consumed samples: 53312 | consumed tokens: 109182976 | elapsed time per iteration (s): 15.15 | learning rate: 1.747E-05 | global batch size: 16 | lm loss: 5.628811E+00 | grad norm: 0.645 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 3333/ 128728 | consumed samples: 53328 | consumed tokens: 109215744 | elapsed time per iteration (s): 15.25 | learning rate: 1.747E-05 | global batch size: 16 | lm loss: 5.761261E+00 | grad norm: 0.767 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3334/ 128728 | consumed samples: 53344 | consumed tokens: 109248512 | elapsed time per iteration (s): 15.15 | learning rate: 1.748E-05 | global batch size: 16 | lm loss: 5.464535E+00 | grad norm: 0.772 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 3335/ 128728 | consumed samples: 53360 | consumed tokens: 109281280 | elapsed time per iteration (s): 15.21 | learning rate: 1.749E-05 | global batch size: 16 | lm loss: 5.644732E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3336/ 128728 | consumed samples: 53376 | consumed tokens: 109314048 | elapsed time per iteration (s): 15.20 | learning rate: 1.749E-05 | global batch size: 16 | lm loss: 5.744635E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3337/ 128728 | consumed samples: 53392 | consumed tokens: 109346816 | elapsed time per iteration (s): 15.20 | learning rate: 1.750E-05 | global batch size: 16 | lm loss: 5.764827E+00 | grad norm: 0.820 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3338/ 128728 | consumed samples: 53408 | consumed tokens: 109379584 | elapsed time per iteration (s): 15.18 | learning rate: 1.750E-05 | global batch size: 16 | lm loss: 5.594451E+00 | grad norm: 0.941 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3339/ 128728 | consumed samples: 53424 | consumed tokens: 109412352 | elapsed time per iteration (s): 15.22 | learning rate: 1.751E-05 | global batch size: 16 | lm loss: 5.622928E+00 | grad norm: 1.031 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3340/ 128728 | consumed samples: 53440 | consumed tokens: 109445120 | elapsed time per iteration (s): 15.24 | learning rate: 1.751E-05 | global batch size: 16 | lm loss: 5.824643E+00 | grad norm: 0.801 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3341/ 128728 | consumed samples: 53456 | consumed tokens: 109477888 | elapsed time per iteration (s): 15.22 | learning rate: 1.752E-05 | global batch size: 16 | lm loss: 5.793392E+00 | grad norm: 0.729 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3342/ 128728 | consumed samples: 53472 | consumed tokens: 109510656 | elapsed time per iteration (s): 15.25 | learning rate: 1.752E-05 | global batch size: 16 | lm loss: 5.710301E+00 | grad norm: 0.701 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3343/ 128728 | consumed samples: 53488 | consumed tokens: 109543424 | elapsed time per iteration (s): 15.23 | learning rate: 1.753E-05 | global batch size: 16 | lm loss: 5.582598E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3344/ 128728 | consumed samples: 53504 | consumed tokens: 109576192 | elapsed time per iteration (s): 15.20 | learning rate: 1.753E-05 | global batch size: 16 | lm loss: 5.832360E+00 | grad norm: 0.667 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3345/ 128728 | consumed samples: 53520 | consumed tokens: 109608960 | elapsed time per iteration (s): 15.22 | learning rate: 1.754E-05 | global batch size: 16 | lm loss: 5.602098E+00 | grad norm: 0.828 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3346/ 128728 | consumed samples: 53536 | consumed tokens: 109641728 | elapsed time per iteration (s): 15.23 | learning rate: 1.754E-05 | global batch size: 16 | lm loss: 5.705314E+00 | grad norm: 0.694 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3347/ 128728 | consumed samples: 53552 | consumed tokens: 109674496 | elapsed time per iteration (s): 15.18 | learning rate: 1.755E-05 | global batch size: 16 | lm loss: 5.765421E+00 | grad norm: 0.696 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3348/ 128728 | consumed samples: 53568 | consumed tokens: 109707264 | elapsed time per iteration (s): 15.18 | learning rate: 1.755E-05 | global batch size: 16 | lm loss: 5.589844E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3349/ 128728 | consumed samples: 53584 | consumed tokens: 109740032 | elapsed time per iteration (s): 15.21 | learning rate: 1.756E-05 | global batch size: 16 | lm loss: 5.752171E+00 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3350/ 128728 | consumed samples: 53600 | consumed tokens: 109772800 | elapsed time per iteration (s): 15.21 | learning rate: 1.756E-05 | global batch size: 16 | lm loss: 5.713757E+00 | grad norm: 0.700 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3351/ 128728 | consumed samples: 53616 | consumed tokens: 109805568 | elapsed time per iteration (s): 15.17 | learning rate: 1.757E-05 | global batch size: 16 | lm loss: 5.712284E+00 | grad norm: 0.690 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3352/ 128728 | consumed samples: 53632 | consumed tokens: 109838336 | elapsed time per iteration (s): 15.20 | learning rate: 1.757E-05 | global batch size: 16 | lm loss: 5.660229E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3353/ 128728 | consumed samples: 53648 | consumed tokens: 109871104 | elapsed time per iteration (s): 15.19 | learning rate: 1.758E-05 | global batch size: 16 | lm loss: 5.759288E+00 | grad norm: 0.687 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3354/ 128728 | consumed samples: 53664 | consumed tokens: 109903872 | elapsed time per iteration (s): 15.18 | learning rate: 1.758E-05 | global batch size: 16 | lm loss: 5.624930E+00 | grad norm: 0.689 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3355/ 128728 | consumed samples: 53680 | consumed tokens: 109936640 | elapsed time per iteration (s): 15.22 | learning rate: 1.759E-05 | global batch size: 16 | lm loss: 5.804910E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3356/ 128728 | consumed samples: 53696 | consumed tokens: 109969408 | elapsed time per iteration (s): 15.20 | learning rate: 1.760E-05 | global batch size: 16 | lm loss: 5.792589E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3357/ 128728 | consumed samples: 53712 | consumed tokens: 110002176 | elapsed time per iteration (s): 15.20 | learning rate: 1.760E-05 | global batch size: 16 | lm loss: 5.710659E+00 | grad norm: 0.839 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3358/ 128728 | consumed samples: 53728 | consumed tokens: 110034944 | elapsed time per iteration (s): 15.23 | learning rate: 1.761E-05 | global batch size: 16 | lm loss: 5.681277E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3359/ 128728 | consumed samples: 53744 | consumed tokens: 110067712 | elapsed time per iteration (s): 15.22 | learning rate: 1.761E-05 | global batch size: 16 | lm loss: 5.616888E+00 | grad norm: 0.679 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3360/ 128728 | consumed samples: 53760 | consumed tokens: 110100480 | elapsed time per iteration (s): 15.21 | learning rate: 1.762E-05 | global batch size: 16 | lm loss: 5.545935E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3361/ 128728 | consumed samples: 53776 | consumed tokens: 110133248 | elapsed time per iteration (s): 15.22 | learning rate: 1.762E-05 | global batch size: 16 | lm loss: 5.594195E+00 | grad norm: 1.097 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3362/ 128728 | consumed samples: 53792 | consumed tokens: 110166016 | elapsed time per iteration (s): 15.20 | learning rate: 1.763E-05 | global batch size: 16 | lm loss: 5.793941E+00 | grad norm: 0.808 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3363/ 128728 | consumed samples: 53808 | consumed tokens: 110198784 | elapsed time per iteration (s): 15.19 | learning rate: 1.763E-05 | global batch size: 16 | lm loss: 5.692922E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3364/ 128728 | consumed samples: 53824 | consumed tokens: 110231552 | elapsed time per iteration (s): 15.23 | learning rate: 1.764E-05 | global batch size: 16 | lm loss: 5.684273E+00 | grad norm: 0.845 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3365/ 128728 | consumed samples: 53840 | consumed tokens: 110264320 | elapsed time per iteration (s): 15.23 | learning rate: 1.764E-05 | global batch size: 16 | lm loss: 5.695712E+00 | grad norm: 0.877 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3366/ 128728 | consumed samples: 53856 | consumed tokens: 110297088 | elapsed time per iteration (s): 15.17 | learning rate: 1.765E-05 | global batch size: 16 | lm loss: 5.798710E+00 | grad norm: 0.818 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3367/ 128728 | consumed samples: 53872 | consumed tokens: 110329856 | elapsed time per iteration (s): 15.21 | learning rate: 1.765E-05 | global batch size: 16 | lm loss: 5.708490E+00 | grad norm: 0.671 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3368/ 128728 | consumed samples: 53888 | consumed tokens: 110362624 | elapsed time per iteration (s): 15.21 | learning rate: 1.766E-05 | global batch size: 16 | lm loss: 5.760231E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3369/ 128728 | consumed samples: 53904 | consumed tokens: 110395392 | elapsed time per iteration (s): 15.22 | learning rate: 1.766E-05 | global batch size: 16 | lm loss: 5.631289E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3370/ 128728 | consumed samples: 53920 | consumed tokens: 110428160 | elapsed time per iteration (s): 15.22 | learning rate: 1.767E-05 | global batch size: 16 | lm loss: 5.564578E+00 | grad norm: 0.795 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3371/ 128728 | consumed samples: 53936 | consumed tokens: 110460928 | elapsed time per iteration (s): 15.23 | learning rate: 1.767E-05 | global batch size: 16 | lm loss: 5.699044E+00 | grad norm: 0.646 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3372/ 128728 | consumed samples: 53952 | consumed tokens: 110493696 | elapsed time per iteration (s): 15.17 | learning rate: 1.768E-05 | global batch size: 16 | lm loss: 5.595973E+00 | grad norm: 0.880 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 3373/ 128728 | consumed samples: 53968 | consumed tokens: 110526464 | elapsed time per iteration (s): 15.26 | learning rate: 1.768E-05 | global batch size: 16 | lm loss: 5.924860E+00 | grad norm: 0.909 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3374/ 128728 | consumed samples: 53984 | consumed tokens: 110559232 | elapsed time per iteration (s): 15.18 | learning rate: 1.769E-05 | global batch size: 16 | lm loss: 5.703710E+00 | grad norm: 0.867 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3375/ 128728 | consumed samples: 54000 | consumed tokens: 110592000 | elapsed time per iteration (s): 15.25 | learning rate: 1.769E-05 | global batch size: 16 | lm loss: 5.843274E+00 | grad norm: 0.824 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3376/ 128728 | consumed samples: 54016 | consumed tokens: 110624768 | elapsed time per iteration (s): 15.20 | learning rate: 1.770E-05 | global batch size: 16 | lm loss: 5.493551E+00 | grad norm: 0.862 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3377/ 128728 | consumed samples: 54032 | consumed tokens: 110657536 | elapsed time per iteration (s): 15.24 | learning rate: 1.771E-05 | global batch size: 16 | lm loss: 5.871907E+00 | grad norm: 1.331 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3378/ 128728 | consumed samples: 54048 | consumed tokens: 110690304 | elapsed time per iteration (s): 15.22 | learning rate: 1.771E-05 | global batch size: 16 | lm loss: 5.754053E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3379/ 128728 | consumed samples: 54064 | consumed tokens: 110723072 | elapsed time per iteration (s): 15.18 | learning rate: 1.772E-05 | global batch size: 16 | lm loss: 5.626816E+00 | grad norm: 0.821 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3380/ 128728 | consumed samples: 54080 | consumed tokens: 110755840 | elapsed time per iteration (s): 15.19 | learning rate: 1.772E-05 | global batch size: 16 | lm loss: 5.704596E+00 | grad norm: 0.887 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3381/ 128728 | consumed samples: 54096 | consumed tokens: 110788608 | elapsed time per iteration (s): 15.23 | learning rate: 1.773E-05 | global batch size: 16 | lm loss: 5.738787E+00 | grad norm: 0.830 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3382/ 128728 | consumed samples: 54112 | consumed tokens: 110821376 | elapsed time per iteration (s): 15.22 | learning rate: 1.773E-05 | global batch size: 16 | lm loss: 5.767883E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3383/ 128728 | consumed samples: 54128 | consumed tokens: 110854144 | elapsed time per iteration (s): 15.23 | learning rate: 1.774E-05 | global batch size: 16 | lm loss: 5.859027E+00 | grad norm: 0.858 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3384/ 128728 | consumed samples: 54144 | consumed tokens: 110886912 | elapsed time per iteration (s): 15.20 | learning rate: 1.774E-05 | global batch size: 16 | lm loss: 5.796133E+00 | grad norm: 0.639 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3385/ 128728 | consumed samples: 54160 | consumed tokens: 110919680 | elapsed time per iteration (s): 15.23 | learning rate: 1.775E-05 | global batch size: 16 | lm loss: 5.630734E+00 | grad norm: 0.800 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3386/ 128728 | consumed samples: 54176 | consumed tokens: 110952448 | elapsed time per iteration (s): 15.21 | learning rate: 1.775E-05 | global batch size: 16 | lm loss: 5.751515E+00 | grad norm: 0.818 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3387/ 128728 | consumed samples: 54192 | consumed tokens: 110985216 | elapsed time per iteration (s): 15.21 | learning rate: 1.776E-05 | global batch size: 16 | lm loss: 5.599256E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3388/ 128728 | consumed samples: 54208 | consumed tokens: 111017984 | elapsed time per iteration (s): 15.15 | learning rate: 1.776E-05 | global batch size: 16 | lm loss: 5.792551E+00 | grad norm: 0.798 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 3389/ 128728 | consumed samples: 54224 | consumed tokens: 111050752 | elapsed time per iteration (s): 15.20 | learning rate: 1.777E-05 | global batch size: 16 | lm loss: 5.626520E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3390/ 128728 | consumed samples: 54240 | consumed tokens: 111083520 | elapsed time per iteration (s): 15.23 | learning rate: 1.777E-05 | global batch size: 16 | lm loss: 5.774959E+00 | grad norm: 0.906 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3391/ 128728 | consumed samples: 54256 | consumed tokens: 111116288 | elapsed time per iteration (s): 15.21 | learning rate: 1.778E-05 | global batch size: 16 | lm loss: 5.683985E+00 | grad norm: 0.689 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3392/ 128728 | consumed samples: 54272 | consumed tokens: 111149056 | elapsed time per iteration (s): 15.17 | learning rate: 1.778E-05 | global batch size: 16 | lm loss: 5.707817E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3393/ 128728 | consumed samples: 54288 | consumed tokens: 111181824 | elapsed time per iteration (s): 15.21 | learning rate: 1.779E-05 | global batch size: 16 | lm loss: 5.814923E+00 | grad norm: 1.365 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3394/ 128728 | consumed samples: 54304 | consumed tokens: 111214592 | elapsed time per iteration (s): 15.21 | learning rate: 1.779E-05 | global batch size: 16 | lm loss: 5.835570E+00 | grad norm: 0.696 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3395/ 128728 | consumed samples: 54320 | consumed tokens: 111247360 | elapsed time per iteration (s): 15.22 | learning rate: 1.780E-05 | global batch size: 16 | lm loss: 5.720476E+00 | grad norm: 1.824 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3396/ 128728 | consumed samples: 54336 | consumed tokens: 111280128 | elapsed time per iteration (s): 15.17 | learning rate: 1.780E-05 | global batch size: 16 | lm loss: 5.840722E+00 | grad norm: 0.757 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3397/ 128728 | consumed samples: 54352 | consumed tokens: 111312896 | elapsed time per iteration (s): 15.20 | learning rate: 1.781E-05 | global batch size: 16 | lm loss: 5.656087E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3398/ 128728 | consumed samples: 54368 | consumed tokens: 111345664 | elapsed time per iteration (s): 15.16 | learning rate: 1.782E-05 | global batch size: 16 | lm loss: 5.573381E+00 | grad norm: 0.690 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3399/ 128728 | consumed samples: 54384 | consumed tokens: 111378432 | elapsed time per iteration (s): 15.17 | learning rate: 1.782E-05 | global batch size: 16 | lm loss: 5.773726E+00 | grad norm: 0.652 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 3400/ 128728 | consumed samples: 54400 | consumed tokens: 111411200 | elapsed time per iteration (s): 15.18 | learning rate: 1.783E-05 | global batch size: 16 | lm loss: 5.521105E+00 | grad norm: 0.927 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3401/ 128728 | consumed samples: 54416 | consumed tokens: 111443968 | elapsed time per iteration (s): 15.22 | learning rate: 1.783E-05 | global batch size: 16 | lm loss: 5.594294E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3402/ 128728 | consumed samples: 54432 | consumed tokens: 111476736 | elapsed time per iteration (s): 15.19 | learning rate: 1.784E-05 | global batch size: 16 | lm loss: 5.854078E+00 | grad norm: 0.751 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3403/ 128728 | consumed samples: 54448 | consumed tokens: 111509504 | elapsed time per iteration (s): 15.17 | learning rate: 1.784E-05 | global batch size: 16 | lm loss: 5.709444E+00 | grad norm: 0.906 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 3404/ 128728 | consumed samples: 54464 | consumed tokens: 111542272 | elapsed time per iteration (s): 15.18 | learning rate: 1.785E-05 | global batch size: 16 | lm loss: 5.785772E+00 | grad norm: 0.734 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3405/ 128728 | consumed samples: 54480 | consumed tokens: 111575040 | elapsed time per iteration (s): 15.22 | learning rate: 1.785E-05 | global batch size: 16 | lm loss: 5.675919E+00 | grad norm: 0.645 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3406/ 128728 | consumed samples: 54496 | consumed tokens: 111607808 | elapsed time per iteration (s): 15.20 | learning rate: 1.786E-05 | global batch size: 16 | lm loss: 5.934880E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3407/ 128728 | consumed samples: 54512 | consumed tokens: 111640576 | elapsed time per iteration (s): 15.22 | learning rate: 1.786E-05 | global batch size: 16 | lm loss: 5.878328E+00 | grad norm: 0.868 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3408/ 128728 | consumed samples: 54528 | consumed tokens: 111673344 | elapsed time per iteration (s): 15.21 | learning rate: 1.787E-05 | global batch size: 16 | lm loss: 5.828094E+00 | grad norm: 0.873 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3409/ 128728 | consumed samples: 54544 | consumed tokens: 111706112 | elapsed time per iteration (s): 15.19 | learning rate: 1.787E-05 | global batch size: 16 | lm loss: 5.730283E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3410/ 128728 | consumed samples: 54560 | consumed tokens: 111738880 | elapsed time per iteration (s): 15.21 | learning rate: 1.788E-05 | global batch size: 16 | lm loss: 5.648894E+00 | grad norm: 1.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3411/ 128728 | consumed samples: 54576 | consumed tokens: 111771648 | elapsed time per iteration (s): 15.20 | learning rate: 1.788E-05 | global batch size: 16 | lm loss: 6.132384E+00 | grad norm: 0.874 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3412/ 128728 | consumed samples: 54592 | consumed tokens: 111804416 | elapsed time per iteration (s): 15.18 | learning rate: 1.789E-05 | global batch size: 16 | lm loss: 5.648220E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3413/ 128728 | consumed samples: 54608 | consumed tokens: 111837184 | elapsed time per iteration (s): 15.19 | learning rate: 1.789E-05 | global batch size: 16 | lm loss: 5.778464E+00 | grad norm: 0.673 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3414/ 128728 | consumed samples: 54624 | consumed tokens: 111869952 | elapsed time per iteration (s): 15.20 | learning rate: 1.790E-05 | global batch size: 16 | lm loss: 5.724689E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3415/ 128728 | consumed samples: 54640 | consumed tokens: 111902720 | elapsed time per iteration (s): 15.20 | learning rate: 1.790E-05 | global batch size: 16 | lm loss: 5.589879E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3416/ 128728 | consumed samples: 54656 | consumed tokens: 111935488 | elapsed time per iteration (s): 15.24 | learning rate: 1.791E-05 | global batch size: 16 | lm loss: 5.682995E+00 | grad norm: 0.687 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3417/ 128728 | consumed samples: 54672 | consumed tokens: 111968256 | elapsed time per iteration (s): 15.16 | learning rate: 1.791E-05 | global batch size: 16 | lm loss: 5.687815E+00 | grad norm: 0.775 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3418/ 128728 | consumed samples: 54688 | consumed tokens: 112001024 | elapsed time per iteration (s): 15.22 | learning rate: 1.792E-05 | global batch size: 16 | lm loss: 5.820484E+00 | grad norm: 0.677 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3419/ 128728 | consumed samples: 54704 | consumed tokens: 112033792 | elapsed time per iteration (s): 15.23 | learning rate: 1.793E-05 | global batch size: 16 | lm loss: 5.659999E+00 | grad norm: 0.698 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3420/ 128728 | consumed samples: 54720 | consumed tokens: 112066560 | elapsed time per iteration (s): 15.21 | learning rate: 1.793E-05 | global batch size: 16 | lm loss: 5.798374E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3421/ 128728 | consumed samples: 54736 | consumed tokens: 112099328 | elapsed time per iteration (s): 15.22 | learning rate: 1.794E-05 | global batch size: 16 | lm loss: 5.579554E+00 | grad norm: 0.686 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3422/ 128728 | consumed samples: 54752 | consumed tokens: 112132096 | elapsed time per iteration (s): 15.21 | learning rate: 1.794E-05 | global batch size: 16 | lm loss: 5.739928E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3423/ 128728 | consumed samples: 54768 | consumed tokens: 112164864 | elapsed time per iteration (s): 15.17 | learning rate: 1.795E-05 | global batch size: 16 | lm loss: 5.720255E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 3424/ 128728 | consumed samples: 54784 | consumed tokens: 112197632 | elapsed time per iteration (s): 15.25 | learning rate: 1.795E-05 | global batch size: 16 | lm loss: 5.507630E+00 | grad norm: 1.070 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3425/ 128728 | consumed samples: 54800 | consumed tokens: 112230400 | elapsed time per iteration (s): 15.19 | learning rate: 1.796E-05 | global batch size: 16 | lm loss: 5.621741E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3426/ 128728 | consumed samples: 54816 | consumed tokens: 112263168 | elapsed time per iteration (s): 15.20 | learning rate: 1.796E-05 | global batch size: 16 | lm loss: 5.538146E+00 | grad norm: 0.708 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3427/ 128728 | consumed samples: 54832 | consumed tokens: 112295936 | elapsed time per iteration (s): 15.21 | learning rate: 1.797E-05 | global batch size: 16 | lm loss: 5.712105E+00 | grad norm: 0.671 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3428/ 128728 | consumed samples: 54848 | consumed tokens: 112328704 | elapsed time per iteration (s): 15.23 | learning rate: 1.797E-05 | global batch size: 16 | lm loss: 5.487374E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3429/ 128728 | consumed samples: 54864 | consumed tokens: 112361472 | elapsed time per iteration (s): 15.19 | learning rate: 1.798E-05 | global batch size: 16 | lm loss: 5.644139E+00 | grad norm: 0.826 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3430/ 128728 | consumed samples: 54880 | consumed tokens: 112394240 | elapsed time per iteration (s): 15.17 | learning rate: 1.798E-05 | global batch size: 16 | lm loss: 5.514249E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 3431/ 128728 | consumed samples: 54896 | consumed tokens: 112427008 | elapsed time per iteration (s): 15.18 | learning rate: 1.799E-05 | global batch size: 16 | lm loss: 5.665630E+00 | grad norm: 0.882 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3432/ 128728 | consumed samples: 54912 | consumed tokens: 112459776 | elapsed time per iteration (s): 15.15 | learning rate: 1.799E-05 | global batch size: 16 | lm loss: 5.801665E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 3433/ 128728 | consumed samples: 54928 | consumed tokens: 112492544 | elapsed time per iteration (s): 15.15 | learning rate: 1.800E-05 | global batch size: 16 | lm loss: 5.669302E+00 | grad norm: 0.783 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 3434/ 128728 | consumed samples: 54944 | consumed tokens: 112525312 | elapsed time per iteration (s): 15.18 | learning rate: 1.800E-05 | global batch size: 16 | lm loss: 5.777668E+00 | grad norm: 0.964 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3435/ 128728 | consumed samples: 54960 | consumed tokens: 112558080 | elapsed time per iteration (s): 15.13 | learning rate: 1.801E-05 | global batch size: 16 | lm loss: 5.705936E+00 | grad norm: 0.943 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.058 | TFLOPs: 8.10 | [default7]: iteration 3436/ 128728 | consumed samples: 54976 | consumed tokens: 112590848 | elapsed time per iteration (s): 15.20 | learning rate: 1.801E-05 | global batch size: 16 | lm loss: 5.854589E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3437/ 128728 | consumed samples: 54992 | consumed tokens: 112623616 | elapsed time per iteration (s): 15.22 | learning rate: 1.802E-05 | global batch size: 16 | lm loss: 5.623005E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3438/ 128728 | consumed samples: 55008 | consumed tokens: 112656384 | elapsed time per iteration (s): 15.21 | learning rate: 1.803E-05 | global batch size: 16 | lm loss: 5.733920E+00 | grad norm: 0.786 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3439/ 128728 | consumed samples: 55024 | consumed tokens: 112689152 | elapsed time per iteration (s): 15.14 | learning rate: 1.803E-05 | global batch size: 16 | lm loss: 5.607145E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 3440/ 128728 | consumed samples: 55040 | consumed tokens: 112721920 | elapsed time per iteration (s): 15.18 | learning rate: 1.804E-05 | global batch size: 16 | lm loss: 5.568397E+00 | grad norm: 0.836 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3441/ 128728 | consumed samples: 55056 | consumed tokens: 112754688 | elapsed time per iteration (s): 15.22 | learning rate: 1.804E-05 | global batch size: 16 | lm loss: 5.497924E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3442/ 128728 | consumed samples: 55072 | consumed tokens: 112787456 | elapsed time per iteration (s): 15.20 | learning rate: 1.805E-05 | global batch size: 16 | lm loss: 5.711787E+00 | grad norm: 0.863 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3443/ 128728 | consumed samples: 55088 | consumed tokens: 112820224 | elapsed time per iteration (s): 15.19 | learning rate: 1.805E-05 | global batch size: 16 | lm loss: 5.645088E+00 | grad norm: 0.989 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3444/ 128728 | consumed samples: 55104 | consumed tokens: 112852992 | elapsed time per iteration (s): 15.18 | learning rate: 1.806E-05 | global batch size: 16 | lm loss: 5.776569E+00 | grad norm: 0.688 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3445/ 128728 | consumed samples: 55120 | consumed tokens: 112885760 | elapsed time per iteration (s): 15.16 | learning rate: 1.806E-05 | global batch size: 16 | lm loss: 5.663031E+00 | grad norm: 0.674 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 3446/ 128728 | consumed samples: 55136 | consumed tokens: 112918528 | elapsed time per iteration (s): 15.22 | learning rate: 1.807E-05 | global batch size: 16 | lm loss: 5.596757E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3447/ 128728 | consumed samples: 55152 | consumed tokens: 112951296 | elapsed time per iteration (s): 15.15 | learning rate: 1.807E-05 | global batch size: 16 | lm loss: 5.633924E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 3448/ 128728 | consumed samples: 55168 | consumed tokens: 112984064 | elapsed time per iteration (s): 15.22 | learning rate: 1.808E-05 | global batch size: 16 | lm loss: 5.418813E+00 | grad norm: 0.834 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3449/ 128728 | consumed samples: 55184 | consumed tokens: 113016832 | elapsed time per iteration (s): 15.20 | learning rate: 1.808E-05 | global batch size: 16 | lm loss: 5.588249E+00 | grad norm: 1.016 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3450/ 128728 | consumed samples: 55200 | consumed tokens: 113049600 | elapsed time per iteration (s): 15.19 | learning rate: 1.809E-05 | global batch size: 16 | lm loss: 5.400003E+00 | grad norm: 0.909 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3451/ 128728 | consumed samples: 55216 | consumed tokens: 113082368 | elapsed time per iteration (s): 15.22 | learning rate: 1.809E-05 | global batch size: 16 | lm loss: 5.908926E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3452/ 128728 | consumed samples: 55232 | consumed tokens: 113115136 | elapsed time per iteration (s): 15.25 | learning rate: 1.810E-05 | global batch size: 16 | lm loss: 5.507290E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3453/ 128728 | consumed samples: 55248 | consumed tokens: 113147904 | elapsed time per iteration (s): 15.19 | learning rate: 1.810E-05 | global batch size: 16 | lm loss: 5.697307E+00 | grad norm: 0.744 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3454/ 128728 | consumed samples: 55264 | consumed tokens: 113180672 | elapsed time per iteration (s): 15.22 | learning rate: 1.811E-05 | global batch size: 16 | lm loss: 5.761248E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3455/ 128728 | consumed samples: 55280 | consumed tokens: 113213440 | elapsed time per iteration (s): 15.22 | learning rate: 1.811E-05 | global batch size: 16 | lm loss: 5.412930E+00 | grad norm: 0.758 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3456/ 128728 | consumed samples: 55296 | consumed tokens: 113246208 | elapsed time per iteration (s): 15.21 | learning rate: 1.812E-05 | global batch size: 16 | lm loss: 5.534837E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3457/ 128728 | consumed samples: 55312 | consumed tokens: 113278976 | elapsed time per iteration (s): 15.23 | learning rate: 1.812E-05 | global batch size: 16 | lm loss: 5.676351E+00 | grad norm: 0.665 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3458/ 128728 | consumed samples: 55328 | consumed tokens: 113311744 | elapsed time per iteration (s): 15.21 | learning rate: 1.813E-05 | global batch size: 16 | lm loss: 5.914691E+00 | grad norm: 0.816 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3459/ 128728 | consumed samples: 55344 | consumed tokens: 113344512 | elapsed time per iteration (s): 15.23 | learning rate: 1.814E-05 | global batch size: 16 | lm loss: 5.779829E+00 | grad norm: 0.750 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3460/ 128728 | consumed samples: 55360 | consumed tokens: 113377280 | elapsed time per iteration (s): 15.21 | learning rate: 1.814E-05 | global batch size: 16 | lm loss: 5.488255E+00 | grad norm: 0.701 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3461/ 128728 | consumed samples: 55376 | consumed tokens: 113410048 | elapsed time per iteration (s): 15.23 | learning rate: 1.815E-05 | global batch size: 16 | lm loss: 5.597379E+00 | grad norm: 0.734 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3462/ 128728 | consumed samples: 55392 | consumed tokens: 113442816 | elapsed time per iteration (s): 15.26 | learning rate: 1.815E-05 | global batch size: 16 | lm loss: 5.796825E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 3463/ 128728 | consumed samples: 55408 | consumed tokens: 113475584 | elapsed time per iteration (s): 15.15 | learning rate: 1.816E-05 | global batch size: 16 | lm loss: 5.453174E+00 | grad norm: 0.937 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 3464/ 128728 | consumed samples: 55424 | consumed tokens: 113508352 | elapsed time per iteration (s): 15.20 | learning rate: 1.816E-05 | global batch size: 16 | lm loss: 5.592092E+00 | grad norm: 0.862 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3465/ 128728 | consumed samples: 55440 | consumed tokens: 113541120 | elapsed time per iteration (s): 15.20 | learning rate: 1.817E-05 | global batch size: 16 | lm loss: 5.629677E+00 | grad norm: 0.915 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3466/ 128728 | consumed samples: 55456 | consumed tokens: 113573888 | elapsed time per iteration (s): 15.24 | learning rate: 1.817E-05 | global batch size: 16 | lm loss: 5.776768E+00 | grad norm: 0.796 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3467/ 128728 | consumed samples: 55472 | consumed tokens: 113606656 | elapsed time per iteration (s): 15.17 | learning rate: 1.818E-05 | global batch size: 16 | lm loss: 5.656150E+00 | grad norm: 0.958 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3468/ 128728 | consumed samples: 55488 | consumed tokens: 113639424 | elapsed time per iteration (s): 15.15 | learning rate: 1.818E-05 | global batch size: 16 | lm loss: 5.554830E+00 | grad norm: 0.673 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 3469/ 128728 | consumed samples: 55504 | consumed tokens: 113672192 | elapsed time per iteration (s): 15.22 | learning rate: 1.819E-05 | global batch size: 16 | lm loss: 5.850750E+00 | grad norm: 0.862 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3470/ 128728 | consumed samples: 55520 | consumed tokens: 113704960 | elapsed time per iteration (s): 15.21 | learning rate: 1.819E-05 | global batch size: 16 | lm loss: 5.848739E+00 | grad norm: 0.831 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3471/ 128728 | consumed samples: 55536 | consumed tokens: 113737728 | elapsed time per iteration (s): 15.14 | learning rate: 1.820E-05 | global batch size: 16 | lm loss: 5.411209E+00 | grad norm: 0.897 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 3472/ 128728 | consumed samples: 55552 | consumed tokens: 113770496 | elapsed time per iteration (s): 15.24 | learning rate: 1.820E-05 | global batch size: 16 | lm loss: 5.765627E+00 | grad norm: 0.831 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3473/ 128728 | consumed samples: 55568 | consumed tokens: 113803264 | elapsed time per iteration (s): 15.21 | learning rate: 1.821E-05 | global batch size: 16 | lm loss: 5.575092E+00 | grad norm: 0.738 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3474/ 128728 | consumed samples: 55584 | consumed tokens: 113836032 | elapsed time per iteration (s): 15.21 | learning rate: 1.821E-05 | global batch size: 16 | lm loss: 5.591868E+00 | grad norm: 0.864 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3475/ 128728 | consumed samples: 55600 | consumed tokens: 113868800 | elapsed time per iteration (s): 15.21 | learning rate: 1.822E-05 | global batch size: 16 | lm loss: 5.551509E+00 | grad norm: 0.701 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3476/ 128728 | consumed samples: 55616 | consumed tokens: 113901568 | elapsed time per iteration (s): 15.22 | learning rate: 1.822E-05 | global batch size: 16 | lm loss: 5.394422E+00 | grad norm: 0.999 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3477/ 128728 | consumed samples: 55632 | consumed tokens: 113934336 | elapsed time per iteration (s): 15.22 | learning rate: 1.823E-05 | global batch size: 16 | lm loss: 5.498854E+00 | grad norm: 0.697 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3478/ 128728 | consumed samples: 55648 | consumed tokens: 113967104 | elapsed time per iteration (s): 15.20 | learning rate: 1.823E-05 | global batch size: 16 | lm loss: 5.861041E+00 | grad norm: 1.413 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3479/ 128728 | consumed samples: 55664 | consumed tokens: 113999872 | elapsed time per iteration (s): 15.17 | learning rate: 1.824E-05 | global batch size: 16 | lm loss: 5.653027E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3480/ 128728 | consumed samples: 55680 | consumed tokens: 114032640 | elapsed time per iteration (s): 15.22 | learning rate: 1.825E-05 | global batch size: 16 | lm loss: 5.562919E+00 | grad norm: 0.906 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3481/ 128728 | consumed samples: 55696 | consumed tokens: 114065408 | elapsed time per iteration (s): 15.21 | learning rate: 1.825E-05 | global batch size: 16 | lm loss: 5.663836E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3482/ 128728 | consumed samples: 55712 | consumed tokens: 114098176 | elapsed time per iteration (s): 15.23 | learning rate: 1.826E-05 | global batch size: 16 | lm loss: 5.682405E+00 | grad norm: 0.845 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3483/ 128728 | consumed samples: 55728 | consumed tokens: 114130944 | elapsed time per iteration (s): 15.22 | learning rate: 1.826E-05 | global batch size: 16 | lm loss: 5.507264E+00 | grad norm: 0.887 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3484/ 128728 | consumed samples: 55744 | consumed tokens: 114163712 | elapsed time per iteration (s): 15.18 | learning rate: 1.827E-05 | global batch size: 16 | lm loss: 5.668527E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3485/ 128728 | consumed samples: 55760 | consumed tokens: 114196480 | elapsed time per iteration (s): 15.22 | learning rate: 1.827E-05 | global batch size: 16 | lm loss: 5.564321E+00 | grad norm: 0.688 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3486/ 128728 | consumed samples: 55776 | consumed tokens: 114229248 | elapsed time per iteration (s): 15.24 | learning rate: 1.828E-05 | global batch size: 16 | lm loss: 5.737549E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3487/ 128728 | consumed samples: 55792 | consumed tokens: 114262016 | elapsed time per iteration (s): 15.24 | learning rate: 1.828E-05 | global batch size: 16 | lm loss: 5.537987E+00 | grad norm: 1.367 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3488/ 128728 | consumed samples: 55808 | consumed tokens: 114294784 | elapsed time per iteration (s): 15.22 | learning rate: 1.829E-05 | global batch size: 16 | lm loss: 5.651535E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3489/ 128728 | consumed samples: 55824 | consumed tokens: 114327552 | elapsed time per iteration (s): 15.21 | learning rate: 1.829E-05 | global batch size: 16 | lm loss: 5.642838E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3490/ 128728 | consumed samples: 55840 | consumed tokens: 114360320 | elapsed time per iteration (s): 15.19 | learning rate: 1.830E-05 | global batch size: 16 | lm loss: 5.894348E+00 | grad norm: 1.402 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 3491/ 128728 | consumed samples: 55856 | consumed tokens: 114393088 | elapsed time per iteration (s): 15.24 | learning rate: 1.830E-05 | global batch size: 16 | lm loss: 5.590985E+00 | grad norm: 0.710 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3492/ 128728 | consumed samples: 55872 | consumed tokens: 114425856 | elapsed time per iteration (s): 15.21 | learning rate: 1.831E-05 | global batch size: 16 | lm loss: 5.752702E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3493/ 128728 | consumed samples: 55888 | consumed tokens: 114458624 | elapsed time per iteration (s): 15.18 | learning rate: 1.831E-05 | global batch size: 16 | lm loss: 5.723320E+00 | grad norm: 0.849 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3494/ 128728 | consumed samples: 55904 | consumed tokens: 114491392 | elapsed time per iteration (s): 15.21 | learning rate: 1.832E-05 | global batch size: 16 | lm loss: 5.537277E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3495/ 128728 | consumed samples: 55920 | consumed tokens: 114524160 | elapsed time per iteration (s): 15.23 | learning rate: 1.832E-05 | global batch size: 16 | lm loss: 5.881509E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3496/ 128728 | consumed samples: 55936 | consumed tokens: 114556928 | elapsed time per iteration (s): 15.22 | learning rate: 1.833E-05 | global batch size: 16 | lm loss: 5.464675E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3497/ 128728 | consumed samples: 55952 | consumed tokens: 114589696 | elapsed time per iteration (s): 15.16 | learning rate: 1.833E-05 | global batch size: 16 | lm loss: 5.424148E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3498/ 128728 | consumed samples: 55968 | consumed tokens: 114622464 | elapsed time per iteration (s): 15.20 | learning rate: 1.834E-05 | global batch size: 16 | lm loss: 5.588657E+00 | grad norm: 0.666 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3499/ 128728 | consumed samples: 55984 | consumed tokens: 114655232 | elapsed time per iteration (s): 15.17 | learning rate: 1.834E-05 | global batch size: 16 | lm loss: 5.764312E+00 | grad norm: 0.806 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 3500/ 128728 | consumed samples: 56000 | consumed tokens: 114688000 | elapsed time per iteration (s): 15.19 | learning rate: 1.835E-05 | global batch size: 16 | lm loss: 5.390745E+00 | grad norm: 0.626 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default0]:saving checkpoint at iteration 3500 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default0]:[2022-03-03 20:47:38,235] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/mp_rank_00_model_states.pt [default1]:[2022-03-03 20:47:38,515] [INFO] [logging.py:69:log_dist] [Rank 1] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/mp_rank_01_model_states.pt [default0]:[2022-03-03 20:47:52,222] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_08_optim_states.pt [default5]:[2022-03-03 20:47:52,426] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_41_optim_states.pt [default4]:[2022-03-03 20:47:52,450] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_40_optim_states.pt [default4]:[2022-03-03 20:47:52,448] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_08_optim_states.pt [default1]:[2022-03-03 20:47:52,517] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_09_optim_states.pt [default6]:[2022-03-03 20:47:52,648] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_10_optim_states.pt [default5]:[2022-03-03 20:47:52,805] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_09_optim_states.pt [default2]:[2022-03-03 20:47:52,889] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_10_optim_states.pt [default0]:[2022-03-03 20:47:52,977] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_40_optim_states.pt [default3]:[2022-03-03 20:47:53,080] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_27_optim_states.pt [default7]:[2022-03-03 20:47:53,093] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_27_optim_states.pt [default7]:[2022-03-03 20:47:53,089] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_43_optim_states.pt [default5]:[2022-03-03 20:47:53,226] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_01_optim_states.pt [default2]:[2022-03-03 20:47:53,259] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_26_optim_states.pt [default7]:[2022-03-03 20:47:53,356] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_11_optim_states.pt [default0]:[2022-03-03 20:47:53,288] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_24_optim_states.pt [default0]:[2022-03-03 20:47:53,349] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_08_optim_states.pt [default6]:[2022-03-03 20:47:53,347] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_26_optim_states.pt [default2]:[2022-03-03 20:47:53,474] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_42_optim_states.pt [default5]:[2022-03-03 20:47:53,514] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_25_optim_states.pt [default3]:[2022-03-03 20:47:53,504] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_43_optim_states.pt [default3]:[2022-03-03 20:47:53,473] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_11_optim_states.pt [default6]:[2022-03-03 20:47:53,513] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_42_optim_states.pt [default2]:[2022-03-03 20:47:53,555] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_02_optim_states.pt [default1]:[2022-03-03 20:47:53,606] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_41_optim_states.pt [default1]:[2022-03-03 20:47:53,617] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_25_optim_states.pt [default1]:[2022-03-03 20:47:53,779] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_09_optim_states.pt [default1]:[2022-03-03 20:47:53,875] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_01_optim_states.pt [default3]:[2022-03-03 20:47:53,946] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_11_optim_states.pt [default5]:[2022-03-03 20:47:54,075] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_41_optim_states.pt [default0]:[2022-03-03 20:47:54,104] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_40_optim_states.pt [default4]:[2022-03-03 20:47:54,112] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_24_optim_states.pt [default2]:[2022-03-03 20:47:54,201] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_42_optim_states.pt [default4]:[2022-03-03 20:47:54,198] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_08_optim_states.pt [default0]:[2022-03-03 20:47:54,233] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt [default1]:[2022-03-03 20:47:54,280] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_41_optim_states.pt [default5]:[2022-03-03 20:47:54,211] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_17_optim_states.pt [default6]:[2022-03-03 20:47:54,294] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_34_optim_states.pt [default7]:[2022-03-03 20:47:54,296] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_43_optim_states.pt [default3]:[2022-03-03 20:47:54,376] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_43_optim_states.pt [default2]:[2022-03-03 20:47:54,384] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_10_optim_states.pt [default4]:[2022-03-03 20:47:54,370] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_40_optim_states.pt [default0]:[2022-03-03 20:47:54,378] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_32_optim_states.pt [default7]:[2022-03-03 20:47:54,390] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_35_optim_states.pt [default1]:[2022-03-03 20:47:54,387] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_33_optim_states.pt [default6]:[2022-03-03 20:47:54,372] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_10_optim_states.pt [default4]:[2022-03-03 20:47:54,408] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt [default7]:[2022-03-03 20:47:54,534] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_11_optim_states.pt [default5]:[2022-03-03 20:47:54,630] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_09_optim_states.pt [default0]:[2022-03-03 20:47:54,590] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_16_optim_states.pt [default4]:[2022-03-03 20:47:54,650] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_32_optim_states.pt [default7]:[2022-03-03 20:47:54,691] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_03_optim_states.pt [default5]:[2022-03-03 20:47:54,692] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_17_optim_states.pt [default6]:[2022-03-03 20:47:54,726] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_42_optim_states.pt [default3]:[2022-03-03 20:47:54,741] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_43_optim_states.pt [default4]:[2022-03-03 20:47:54,752] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_40_optim_states.pt [default3]:[2022-03-03 20:47:54,810] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_35_optim_states.pt [default2]:[2022-03-03 20:47:54,906] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_34_optim_states.pt [default2]:[2022-03-03 20:47:54,949] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_42_optim_states.pt [default3]:[2022-03-03 20:47:54,899] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_03_optim_states.pt [default0]:[2022-03-03 20:47:55,048] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_28_optim_states.pt [default6]:[2022-03-03 20:47:55,105] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_02_optim_states.pt [default6]:[2022-03-03 20:47:55,094] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_30_optim_states.pt [default5]:[2022-03-03 20:47:55,185] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_41_optim_states.pt [default1]:[2022-03-03 20:47:55,214] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_29_optim_states.pt [default2]:[2022-03-03 20:47:55,188] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_34_optim_states.pt [default4]:[2022-03-03 20:47:55,314] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_28_optim_states.pt [default5]:[2022-03-03 20:47:55,304] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_33_optim_states.pt [default7]:[2022-03-03 20:47:55,305] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_31_optim_states.pt [default2]:[2022-03-03 20:47:55,317] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_30_optim_states.pt [default2]:[2022-03-03 20:47:55,472] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_30_optim_states.pt [default1]:[2022-03-03 20:47:55,526] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_17_optim_states.pt [default6]:[2022-03-03 20:47:55,598] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_18_optim_states.pt [default4]:[2022-03-03 20:47:55,653] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_16_optim_states.pt [default7]:[2022-03-03 20:47:55,711] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_19_optim_states.pt [default3]:[2022-03-03 20:47:55,771] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_31_optim_states.pt [default5]:[2022-03-03 20:47:55,846] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_29_optim_states.pt [default6]:[2022-03-03 20:47:55,832] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_38_optim_states.pt [default0]:[2022-03-03 20:47:55,939] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_28_optim_states.pt [default3]:[2022-03-03 20:47:56,033] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_19_optim_states.pt [default2]:[2022-03-03 20:47:56,082] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_18_optim_states.pt [default1]:[2022-03-03 20:47:56,097] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_29_optim_states.pt [default0]:[2022-03-03 20:47:56,117] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_20_optim_states.pt [default4]:[2022-03-03 20:47:56,174] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_28_optim_states.pt [default3]:[2022-03-03 20:47:56,198] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_31_optim_states.pt [default1]:[2022-03-03 20:47:56,201] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_41_optim_states.pt [default6]:[2022-03-03 20:47:56,259] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_30_optim_states.pt [default0]:[2022-03-03 20:47:56,330] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_40_optim_states.pt [default5]:[2022-03-03 20:47:56,385] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_29_optim_states.pt [default2]:[2022-03-03 20:47:56,512] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_38_optim_states.pt [default7]:[2022-03-03 20:47:56,556] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_31_optim_states.pt [default5]:[2022-03-03 20:47:56,578] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_37_optim_states.pt [default5]:[2022-03-03 20:47:56,791] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt [default6]:[2022-03-03 20:47:56,855] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_18_optim_states.pt [default4]:[2022-03-03 20:47:56,886] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_16_optim_states.pt [default1]:[2022-03-03 20:47:57,060] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_09_optim_states.pt [default2]:[2022-03-03 20:47:57,129] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_30_optim_states.pt [default7]:[2022-03-03 20:47:57,209] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_15_optim_states.pt [default3]:[2022-03-03 20:47:57,159] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_39_optim_states.pt [default6]:[2022-03-03 20:47:57,190] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_42_optim_states.pt [default1]:[2022-03-03 20:47:57,219] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_37_optim_states.pt [default7]:[2022-03-03 20:47:57,312] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_39_optim_states.pt [default4]:[2022-03-03 20:47:57,300] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_36_optim_states.pt [default1]:[2022-03-03 20:47:57,187] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_33_optim_states.pt [default0]:[2022-03-03 20:47:57,288] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_36_optim_states.pt [default3]:[2022-03-03 20:47:57,365] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_27_optim_states.pt [default5]:[2022-03-03 20:47:57,330] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_37_optim_states.pt [default2]:[2022-03-03 20:47:57,477] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_06_optim_states.pt [default4]:[2022-03-03 20:47:57,464] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt [default7]:[2022-03-03 20:47:57,526] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_43_optim_states.pt [default4]:[2022-03-03 20:47:57,510] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_12_optim_states.pt [default6]:[2022-03-03 20:47:57,468] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_14_optim_states.pt [default3]:[2022-03-03 20:47:57,491] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_07_optim_states.pt [default1]:[2022-03-03 20:47:57,598] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_21_optim_states.pt [default2]:[2022-03-03 20:47:57,658] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_26_optim_states.pt [default3]:[2022-03-03 20:47:57,768] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_19_optim_states.pt [default4]:[2022-03-03 20:47:57,774] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_36_optim_states.pt [default7]:[2022-03-03 20:47:57,930] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_19_optim_states.pt [default6]:[2022-03-03 20:47:57,900] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_42_optim_states.pt [default7]:[2022-03-03 20:47:57,985] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_43_optim_states.pt [default4]:[2022-03-03 20:47:57,955] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_44_optim_states.pt [default2]:[2022-03-03 20:47:58,079] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_18_optim_states.pt [default0]:[2022-03-03 20:47:58,056] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_20_optim_states.pt [default0]:[2022-03-03 20:47:58,104] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_36_optim_states.pt [default5]:[2022-03-03 20:47:58,162] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_45_optim_states.pt [default1]:[2022-03-03 20:47:58,125] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_37_optim_states.pt [default3]:[2022-03-03 20:47:58,198] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_03_optim_states.pt [default1]:[2022-03-03 20:47:58,219] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_33_optim_states.pt [default3]:[2022-03-03 20:47:58,181] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_35_optim_states.pt [default2]:[2022-03-03 20:47:58,252] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_22_optim_states.pt [default0]:[2022-03-03 20:47:58,226] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_32_optim_states.pt [default3]:[2022-03-03 20:47:58,285] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_23_optim_states.pt [default7]:[2022-03-03 20:47:58,239] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_35_optim_states.pt [default1]:[2022-03-03 20:47:58,324] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_01_optim_states.pt [default0]:[2022-03-03 20:47:58,311] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_16_optim_states.pt [default6]:[2022-03-03 20:47:58,246] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_22_optim_states.pt [default4]:[2022-03-03 20:47:58,415] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_20_optim_states.pt [default0]:[2022-03-03 20:47:58,429] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_16_optim_states.pt [default2]:[2022-03-03 20:47:58,495] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_02_optim_states.pt [default1]:[2022-03-03 20:47:58,521] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_17_optim_states.pt [default1]:[2022-03-03 20:47:58,506] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_21_optim_states.pt [default4]:[2022-03-03 20:47:58,539] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_32_optim_states.pt [default4]:[2022-03-03 20:47:58,407] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_20_optim_states.pt [default5]:[2022-03-03 20:47:58,540] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_13_optim_states.pt [default5]:[2022-03-03 20:47:58,528] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_21_optim_states.pt [default5]:[2022-03-03 20:47:58,642] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_33_optim_states.pt [default7]:[2022-03-03 20:47:58,512] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_39_optim_states.pt [default7]:[2022-03-03 20:47:58,668] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_23_optim_states.pt [default0]:[2022-03-03 20:47:58,642] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt [default6]:[2022-03-03 20:47:58,664] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_02_optim_states.pt [default7]:[2022-03-03 20:47:58,707] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_23_optim_states.pt [default0]:[2022-03-03 20:47:58,711] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_12_optim_states.pt [default7]:[2022-03-03 20:47:58,821] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_31_optim_states.pt [default6]:[2022-03-03 20:47:58,818] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_30_optim_states.pt [default5]:[2022-03-03 20:47:58,854] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_21_optim_states.pt [default6]:[2022-03-03 20:47:58,861] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_34_optim_states.pt [default7]:[2022-03-03 20:47:59,001] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_03_optim_states.pt [default0]:[2022-03-03 20:47:59,008] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_32_optim_states.pt [default7]:[2022-03-03 20:47:58,997] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_35_optim_states.pt [default6]:[2022-03-03 20:47:58,986] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_38_optim_states.pt [default7]:[2022-03-03 20:47:59,041] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_47_optim_states.pt [default2]:[2022-03-03 20:47:59,047] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_30_optim_states.pt [default7]:[2022-03-03 20:47:59,098] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_23_optim_states.pt [default6]:[2022-03-03 20:47:59,073] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_34_optim_states.pt [default0]:[2022-03-03 20:47:59,063] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_20_optim_states.pt [default4]:[2022-03-03 20:47:59,158] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_20_optim_states.pt [default4]:[2022-03-03 20:47:59,194] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_08_optim_states.pt [default6]:[2022-03-03 20:47:59,132] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_22_optim_states.pt [default5]:[2022-03-03 20:47:59,164] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_21_optim_states.pt [default3]:[2022-03-03 20:47:59,156] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_23_optim_states.pt [default3]:[2022-03-03 20:47:59,261] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_31_optim_states.pt [default5]:[2022-03-03 20:47:59,243] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_09_optim_states.pt [default6]:[2022-03-03 20:47:59,405] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_22_optim_states.pt [default7]:[2022-03-03 20:47:59,337] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_47_optim_states.pt [default1]:[2022-03-03 20:47:59,386] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_45_optim_states.pt [default0]:[2022-03-03 20:47:59,362] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_08_optim_states.pt [default3]:[2022-03-03 20:47:59,418] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_23_optim_states.pt [default0]:[2022-03-03 20:47:59,413] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_44_optim_states.pt [default4]:[2022-03-03 20:47:59,506] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_04_optim_states.pt [default0]:[2022-03-03 20:47:59,504] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_12_optim_states.pt [default2]:[2022-03-03 20:47:59,585] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_22_optim_states.pt [default6]:[2022-03-03 20:47:59,560] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_46_optim_states.pt [default4]:[2022-03-03 20:47:59,537] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_28_optim_states.pt [default5]:[2022-03-03 20:47:59,625] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_05_optim_states.pt [default2]:[2022-03-03 20:47:59,646] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_22_optim_states.pt [default4]:[2022-03-03 20:47:59,599] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_04_optim_states.pt [default1]:[2022-03-03 20:47:59,613] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_25_optim_states.pt [default0]:[2022-03-03 20:47:59,690] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_04_optim_states.pt [default4]:[2022-03-03 20:47:59,718] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_20_optim_states.pt [default1]:[2022-03-03 20:47:59,611] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_45_optim_states.pt [default0]:[2022-03-03 20:47:59,786] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_24_optim_states.pt [default1]:[2022-03-03 20:47:59,778] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_21_optim_states.pt [default3]:[2022-03-03 20:47:59,802] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_23_optim_states.pt [default6]:[2022-03-03 20:47:59,768] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_06_optim_states.pt [default2]:[2022-03-03 20:47:59,773] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_38_optim_states.pt [default3]:[2022-03-03 20:47:59,795] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_39_optim_states.pt [default2]:[2022-03-03 20:47:59,832] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_06_optim_states.pt [default6]:[2022-03-03 20:47:59,782] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_46_optim_states.pt [default5]:[2022-03-03 20:47:59,895] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_05_optim_states.pt [default3]:[2022-03-03 20:47:59,920] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_43_optim_states.pt [default2]:[2022-03-03 20:47:59,867] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_22_optim_states.pt [default0]:[2022-03-03 20:47:59,756] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_44_optim_states.pt [default1]:[2022-03-03 20:48:00,023] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_25_optim_states.pt [default1]:[2022-03-03 20:48:00,019] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_41_optim_states.pt [default3]:[2022-03-03 20:48:00,016] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_15_optim_states.pt [default3]:[2022-03-03 20:48:00,047] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_27_optim_states.pt [default0]:[2022-03-03 20:48:00,124] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_12_optim_states.pt [default6]:[2022-03-03 20:48:00,103] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_06_optim_states.pt [default0]:[2022-03-03 20:48:00,110] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_24_optim_states.pt [default1]:[2022-03-03 20:48:00,090] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_13_optim_states.pt [default0]:[2022-03-03 20:48:00,104] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_24_optim_states.pt [default1]:[2022-03-03 20:48:00,234] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_17_optim_states.pt [default3]:[2022-03-03 20:48:00,207] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_19_optim_states.pt [default2]:[2022-03-03 20:48:00,309] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_18_optim_states.pt [default4]:[2022-03-03 20:48:00,297] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_04_optim_states.pt [default5]:[2022-03-03 20:48:00,356] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_17_optim_states.pt [default2]:[2022-03-03 20:48:00,360] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_14_optim_states.pt [default2]:[2022-03-03 20:48:00,433] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_14_optim_states.pt [default3]:[2022-03-03 20:48:00,439] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_15_optim_states.pt [default7]:[2022-03-03 20:48:00,447] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_07_optim_states.pt [default3]:[2022-03-03 20:48:00,552] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_31_optim_states.pt [default6]:[2022-03-03 20:48:00,583] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_14_optim_states.pt [default1]:[2022-03-03 20:48:00,579] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_13_optim_states.pt [default2]:[2022-03-03 20:48:00,549] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_42_optim_states.pt [default5]:[2022-03-03 20:48:00,598] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_29_optim_states.pt [default4]:[2022-03-03 20:48:00,663] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_08_optim_states.pt [default5]:[2022-03-03 20:48:00,651] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_05_optim_states.pt [default3]:[2022-03-03 20:48:00,705] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_07_optim_states.pt [default2]:[2022-03-03 20:48:00,634] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_26_optim_states.pt [default7]:[2022-03-03 20:48:00,647] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_15_optim_states.pt [default4]:[2022-03-03 20:48:00,744] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_12_optim_states.pt [default1]:[2022-03-03 20:48:00,759] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_13_optim_states.pt [default0]:[2022-03-03 20:48:00,808] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_40_optim_states.pt [default3]:[2022-03-03 20:48:00,839] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_47_optim_states.pt [default4]:[2022-03-03 20:48:00,859] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_32_optim_states.pt [default5]:[2022-03-03 20:48:00,909] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_29_optim_states.pt [default5]:[2022-03-03 20:48:00,935] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_33_optim_states.pt [default5]:[2022-03-03 20:48:00,932] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_13_optim_states.pt [default1]:[2022-03-03 20:48:00,974] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_09_optim_states.pt [default5]:[2022-03-03 20:48:01,037] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_01_optim_states.pt [default0]:[2022-03-03 20:48:00,985] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_08_optim_states.pt [default4]:[2022-03-03 20:48:00,988] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_16_optim_states.pt [default4]:[2022-03-03 20:48:01,093] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_24_optim_states.pt [default2]:[2022-03-03 20:48:01,174] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_34_optim_states.pt [default4]:[2022-03-03 20:48:01,177] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_28_optim_states.pt [default7]:[2022-03-03 20:48:01,105] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_47_optim_states.pt [default5]:[2022-03-03 20:48:01,182] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_21_optim_states.pt [default5]:[2022-03-03 20:48:01,163] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_25_optim_states.pt [default1]:[2022-03-03 20:48:01,199] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_05_optim_states.pt [default3]:[2022-03-03 20:48:01,265] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_35_optim_states.pt [default4]:[2022-03-03 20:48:01,328] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt [default2]:[2022-03-03 20:48:01,380] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_02_optim_states.pt [default0]:[2022-03-03 20:48:01,358] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_04_optim_states.pt [default3]:[2022-03-03 20:48:01,467] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_03_optim_states.pt [default7]:[2022-03-03 20:48:01,440] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_27_optim_states.pt [default6]:[2022-03-03 20:48:01,447] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_30_optim_states.pt [default7]:[2022-03-03 20:48:01,500] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_23_optim_states.pt [default0]:[2022-03-03 20:48:01,514] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_28_optim_states.pt [default5]:[2022-03-03 20:48:01,532] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_13_optim_states.pt [default6]:[2022-03-03 20:48:01,576] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_26_optim_states.pt [default2]:[2022-03-03 20:48:01,641] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_26_optim_states.pt [default3]:[2022-03-03 20:48:01,579] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_47_optim_states.pt [default1]:[2022-03-03 20:48:01,624] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_25_optim_states.pt [default4]:[2022-03-03 20:48:01,664] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_12_optim_states.pt [default6]:[2022-03-03 20:48:01,689] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_18_optim_states.pt [default6]:[2022-03-03 20:48:01,661] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_14_optim_states.pt [default6]:[2022-03-03 20:48:01,688] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_26_optim_states.pt [default5]:[2022-03-03 20:48:01,684] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_25_optim_states.pt [default4]:[2022-03-03 20:48:01,788] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_24_optim_states.pt [default1]:[2022-03-03 20:48:01,713] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_29_optim_states.pt [default7]:[2022-03-03 20:48:01,886] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_31_optim_states.pt [default7]:[2022-03-03 20:48:01,839] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_19_optim_states.pt [default3]:[2022-03-03 20:48:01,886] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_15_optim_states.pt [default1]:[2022-03-03 20:48:01,966] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_01_optim_states.pt [default2]:[2022-03-03 20:48:02,041] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_14_optim_states.pt [default7]:[2022-03-03 20:48:02,037] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_27_optim_states.pt [default6]:[2022-03-03 20:48:01,999] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_10_optim_states.pt [default7]:[2022-03-03 20:48:02,139] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_15_optim_states.pt [default5]:[2022-03-03 20:48:02,150] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_09_optim_states.pt [default2]:[2022-03-03 20:48:02,155] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_46_optim_states.pt [default0]:[2022-03-03 20:48:02,263] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_28_optim_states.pt [default0]:[2022-03-03 20:48:02,270] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_32_optim_states.pt [default1]:[2022-03-03 20:48:02,250] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_05_optim_states.pt [default1]:[2022-03-03 20:48:02,327] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_29_optim_states.pt [default3]:[2022-03-03 20:48:02,403] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_19_optim_states.pt [default6]:[2022-03-03 20:48:02,498] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_22_optim_states.pt [default2]:[2022-03-03 20:48:02,614] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_18_optim_states.pt [default7]:[2022-03-03 20:48:02,657] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_07_optim_states.pt [default7]:[2022-03-03 20:48:02,625] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_11_optim_states.pt [default3]:[2022-03-03 20:48:02,801] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_11_optim_states.pt [default1]:[2022-03-03 20:48:02,869] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_05_optim_states.pt [default1]:[2022-03-03 20:48:02,870] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_45_optim_states.pt [default0]:[2022-03-03 20:48:02,873] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_04_optim_states.pt [default2]:[2022-03-03 20:48:02,994] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_06_optim_states.pt [default5]:[2022-03-03 20:48:03,024] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_45_optim_states.pt [default2]:[2022-03-03 20:48:03,081] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_46_optim_states.pt [default1]:[2022-03-03 20:48:03,131] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_21_optim_states.pt [default4]:[2022-03-03 20:48:03,122] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_44_optim_states.pt [default3]:[2022-03-03 20:48:03,179] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_47_optim_states.pt [default0]:[2022-03-03 20:48:03,091] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_20_optim_states.pt [default7]:[2022-03-03 20:48:03,341] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_39_optim_states.pt [default3]:[2022-03-03 20:48:03,400] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_11_optim_states.pt [default6]:[2022-03-03 20:48:03,418] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_46_optim_states.pt [default4]:[2022-03-03 20:48:03,486] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_36_optim_states.pt [default6]:[2022-03-03 20:48:03,492] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_38_optim_states.pt [default3]:[2022-03-03 20:48:03,459] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_07_optim_states.pt [default1]:[2022-03-03 20:48:03,518] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_37_optim_states.pt [default0]:[2022-03-03 20:48:03,512] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_36_optim_states.pt [default2]:[2022-03-03 20:48:03,587] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_46_optim_states.pt [default6]:[2022-03-03 20:48:03,681] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_02_optim_states.pt [default4]:[2022-03-03 20:48:03,699] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_04_optim_states.pt [default6]:[2022-03-03 20:48:03,688] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_38_optim_states.pt [default7]:[2022-03-03 20:48:03,692] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_03_optim_states.pt [default1]:[2022-03-03 20:48:03,764] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_37_optim_states.pt [default0]:[2022-03-03 20:48:03,741] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt [default1]:[2022-03-03 20:48:03,739] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_01_optim_states.pt [default3]:[2022-03-03 20:48:03,766] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_15_optim_states.pt [default7]:[2022-03-03 20:48:03,717] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_39_optim_states.pt [default7]:[2022-03-03 20:48:03,833] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_15_optim_states.pt [default2]:[2022-03-03 20:48:03,789] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_14_optim_states.pt [default0]:[2022-03-03 20:48:03,846] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [default2]:[2022-03-03 20:48:03,915] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_46_optim_states.pt [default6]:[2022-03-03 20:48:03,897] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_06_optim_states.pt [default5]:[2022-03-03 20:48:03,891] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_05_optim_states.pt [default2]:[2022-03-03 20:48:03,923] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_10_optim_states.pt [default7]:[2022-03-03 20:48:03,960] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_07_optim_states.pt [default6]:[2022-03-03 20:48:03,983] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_02_optim_states.pt [default3]:[2022-03-03 20:48:04,148] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_03_optim_states.pt [default3]:[2022-03-03 20:48:04,172] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_47_optim_states.pt [default5]:[2022-03-03 20:48:04,236] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_33_optim_states.pt [default0]:[2022-03-03 20:48:04,216] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_44_optim_states.pt [default5]:[2022-03-03 20:48:04,087] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_37_optim_states.pt [default4]:[2022-03-03 20:48:04,236] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_32_optim_states.pt [default3]:[2022-03-03 20:48:04,314] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_35_optim_states.pt [default2]:[2022-03-03 20:48:04,374] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_34_optim_states.pt [default3]:[2022-03-03 20:48:04,453] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_07_optim_states.pt [default2]:[2022-03-03 20:48:04,515] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_06_optim_states.pt [default2]:[2022-03-03 20:48:04,547] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_10_optim_states.pt [default3]:[2022-03-03 20:48:04,725] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_27_optim_states.pt [default2]:[2022-03-03 20:48:04,838] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_38_optim_states.pt [default6]:[2022-03-03 20:48:04,947] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_18_optim_states.pt [default7]:[2022-03-03 20:48:04,889] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_11_optim_states.pt [default3]:[2022-03-03 20:48:04,955] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_39_optim_states.pt [default6]:[2022-03-03 20:48:04,985] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_06_optim_states.pt [default6]:[2022-03-03 20:48:04,918] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_10_optim_states.pt [default0]:[2022-03-03 20:48:04,947] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_12_optim_states.pt [default1]:[2022-03-03 20:48:05,108] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_17_optim_states.pt [default7]:[2022-03-03 20:48:05,031] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_07_optim_states.pt [default6]:[2022-03-03 20:48:05,133] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_14_optim_states.pt [default2]:[2022-03-03 20:48:05,153] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_02_optim_states.pt [default7]:[2022-03-03 20:48:05,284] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_03_optim_states.pt [default1]:[2022-03-03 20:48:05,642] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_4_mp_rank_13_optim_states.pt [default7]:[2022-03-03 20:48:05,648] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_19_optim_states.pt [default0]:[2022-03-03 20:48:05,702] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_6_mp_rank_16_optim_states.pt [default0]:[2022-03-03 20:48:05,687] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_44_optim_states.pt [default1]:[2022-03-03 20:48:05,887] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_2_mp_rank_45_optim_states.pt [default1]:[2022-03-03 20:48:06,170] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_33_optim_states.pt [default5]:[2022-03-03 20:48:06,209] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_45_optim_states.pt [default4]:[2022-03-03 20:48:06,298] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_44_optim_states.pt [default6]:[2022-03-03 20:48:06,371] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_46_optim_states.pt [default7]:[2022-03-03 20:48:06,416] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_47_optim_states.pt [default5]:[2022-03-03 20:48:06,731] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_45_optim_states.pt [default4]:[2022-03-03 20:48:06,772] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_3_mp_rank_44_optim_states.pt [default4]:[2022-03-03 20:48:06,812] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_12_optim_states.pt [default0]:[2022-03-03 20:48:06,939] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_36_optim_states.pt [default5]:[2022-03-03 20:48:06,886] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_13_optim_states.pt [default3]:[2022-03-03 20:48:06,963] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_39_optim_states.pt [default2]:[2022-03-03 20:48:06,958] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_38_optim_states.pt [default1]:[2022-03-03 20:48:07,008] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_05_optim_states.pt [default0]:[2022-03-03 20:48:07,017] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_0_mp_rank_04_optim_states.pt [default7]:[2022-03-03 20:48:07,127] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_35_optim_states.pt [default6]:[2022-03-03 20:48:07,306] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_34_optim_states.pt [default5]:[2022-03-03 20:48:07,452] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_41_optim_states.pt [default4]:[2022-03-03 20:48:07,456] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_40_optim_states.pt [default5]:[2022-03-03 20:48:07,409] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_37_optim_states.pt [default4]:[2022-03-03 20:48:07,461] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_36_optim_states.pt [default5]:[2022-03-03 20:48:08,273] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_17_optim_states.pt [default4]:[2022-03-03 20:48:08,330] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_7_mp_rank_16_optim_states.pt [default4]:[2022-03-03 20:48:09,368] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_24_optim_states.pt [default5]:[2022-03-03 20:48:09,671] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_25_optim_states.pt [default7]:[2022-03-03 20:48:09,833] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_27_optim_states.pt [default6]:[2022-03-03 20:48:09,912] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_5_mp_rank_26_optim_states.pt [default4]:[2022-03-03 20:48:10,958] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt [default5]:[2022-03-03 20:48:10,996] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step3500/bf16_zero_pp_rank_1_mp_rank_01_optim_states.pt [default7]:time (ms) | save-checkpoint: 42646.90 [default0]: successfully saved checkpoint at iteration 3500 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default7]: iteration 3501/ 128728 | consumed samples: 56016 | consumed tokens: 114720768 | elapsed time per iteration (s): 57.87 | learning rate: 1.836E-05 | global batch size: 16 | lm loss: 5.723885E+00 | grad norm: 0.804 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 0.276 | TFLOPs: 2.12 | [default7]: iteration 3502/ 128728 | consumed samples: 56032 | consumed tokens: 114753536 | elapsed time per iteration (s): 15.19 | learning rate: 1.836E-05 | global batch size: 16 | lm loss: 5.632663E+00 | grad norm: 0.687 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3503/ 128728 | consumed samples: 56048 | consumed tokens: 114786304 | elapsed time per iteration (s): 15.18 | learning rate: 1.837E-05 | global batch size: 16 | lm loss: 5.605666E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3504/ 128728 | consumed samples: 56064 | consumed tokens: 114819072 | elapsed time per iteration (s): 15.21 | learning rate: 1.837E-05 | global batch size: 16 | lm loss: 5.401419E+00 | grad norm: 0.784 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3505/ 128728 | consumed samples: 56080 | consumed tokens: 114851840 | elapsed time per iteration (s): 15.18 | learning rate: 1.838E-05 | global batch size: 16 | lm loss: 5.347599E+00 | grad norm: 0.800 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3506/ 128728 | consumed samples: 56096 | consumed tokens: 114884608 | elapsed time per iteration (s): 15.22 | learning rate: 1.838E-05 | global batch size: 16 | lm loss: 5.724100E+00 | grad norm: 0.700 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3507/ 128728 | consumed samples: 56112 | consumed tokens: 114917376 | elapsed time per iteration (s): 15.20 | learning rate: 1.839E-05 | global batch size: 16 | lm loss: 5.753415E+00 | grad norm: 0.728 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3508/ 128728 | consumed samples: 56128 | consumed tokens: 114950144 | elapsed time per iteration (s): 15.20 | learning rate: 1.839E-05 | global batch size: 16 | lm loss: 5.617841E+00 | grad norm: 0.639 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3509/ 128728 | consumed samples: 56144 | consumed tokens: 114982912 | elapsed time per iteration (s): 15.22 | learning rate: 1.840E-05 | global batch size: 16 | lm loss: 5.481423E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3510/ 128728 | consumed samples: 56160 | consumed tokens: 115015680 | elapsed time per iteration (s): 15.21 | learning rate: 1.840E-05 | global batch size: 16 | lm loss: 5.751305E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3511/ 128728 | consumed samples: 56176 | consumed tokens: 115048448 | elapsed time per iteration (s): 15.19 | learning rate: 1.841E-05 | global batch size: 16 | lm loss: 5.546777E+00 | grad norm: 0.808 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3512/ 128728 | consumed samples: 56192 | consumed tokens: 115081216 | elapsed time per iteration (s): 15.21 | learning rate: 1.841E-05 | global batch size: 16 | lm loss: 5.549484E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3513/ 128728 | consumed samples: 56208 | consumed tokens: 115113984 | elapsed time per iteration (s): 15.19 | learning rate: 1.842E-05 | global batch size: 16 | lm loss: 5.823066E+00 | grad norm: 0.766 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3514/ 128728 | consumed samples: 56224 | consumed tokens: 115146752 | elapsed time per iteration (s): 15.21 | learning rate: 1.842E-05 | global batch size: 16 | lm loss: 5.627325E+00 | grad norm: 0.633 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3515/ 128728 | consumed samples: 56240 | consumed tokens: 115179520 | elapsed time per iteration (s): 15.20 | learning rate: 1.843E-05 | global batch size: 16 | lm loss: 5.750383E+00 | grad norm: 0.690 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3516/ 128728 | consumed samples: 56256 | consumed tokens: 115212288 | elapsed time per iteration (s): 15.20 | learning rate: 1.843E-05 | global batch size: 16 | lm loss: 5.526446E+00 | grad norm: 0.700 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3517/ 128728 | consumed samples: 56272 | consumed tokens: 115245056 | elapsed time per iteration (s): 15.23 | learning rate: 1.844E-05 | global batch size: 16 | lm loss: 5.721191E+00 | grad norm: 0.904 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3518/ 128728 | consumed samples: 56288 | consumed tokens: 115277824 | elapsed time per iteration (s): 15.19 | learning rate: 1.844E-05 | global batch size: 16 | lm loss: 5.594088E+00 | grad norm: 0.678 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3519/ 128728 | consumed samples: 56304 | consumed tokens: 115310592 | elapsed time per iteration (s): 15.24 | learning rate: 1.845E-05 | global batch size: 16 | lm loss: 5.617053E+00 | grad norm: 0.661 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3520/ 128728 | consumed samples: 56320 | consumed tokens: 115343360 | elapsed time per iteration (s): 15.24 | learning rate: 1.845E-05 | global batch size: 16 | lm loss: 5.854468E+00 | grad norm: 0.757 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3521/ 128728 | consumed samples: 56336 | consumed tokens: 115376128 | elapsed time per iteration (s): 15.23 | learning rate: 1.846E-05 | global batch size: 16 | lm loss: 5.725595E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3522/ 128728 | consumed samples: 56352 | consumed tokens: 115408896 | elapsed time per iteration (s): 15.22 | learning rate: 1.847E-05 | global batch size: 16 | lm loss: 5.628036E+00 | grad norm: 0.692 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3523/ 128728 | consumed samples: 56368 | consumed tokens: 115441664 | elapsed time per iteration (s): 15.24 | learning rate: 1.847E-05 | global batch size: 16 | lm loss: 5.498308E+00 | grad norm: 0.805 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3524/ 128728 | consumed samples: 56384 | consumed tokens: 115474432 | elapsed time per iteration (s): 15.19 | learning rate: 1.848E-05 | global batch size: 16 | lm loss: 5.595693E+00 | grad norm: 0.702 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3525/ 128728 | consumed samples: 56400 | consumed tokens: 115507200 | elapsed time per iteration (s): 15.21 | learning rate: 1.848E-05 | global batch size: 16 | lm loss: 5.580093E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3526/ 128728 | consumed samples: 56416 | consumed tokens: 115539968 | elapsed time per iteration (s): 15.23 | learning rate: 1.849E-05 | global batch size: 16 | lm loss: 5.568558E+00 | grad norm: 0.766 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3527/ 128728 | consumed samples: 56432 | consumed tokens: 115572736 | elapsed time per iteration (s): 15.24 | learning rate: 1.849E-05 | global batch size: 16 | lm loss: 5.609416E+00 | grad norm: 0.660 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3528/ 128728 | consumed samples: 56448 | consumed tokens: 115605504 | elapsed time per iteration (s): 15.18 | learning rate: 1.850E-05 | global batch size: 16 | lm loss: 5.554018E+00 | grad norm: 0.656 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3529/ 128728 | consumed samples: 56464 | consumed tokens: 115638272 | elapsed time per iteration (s): 15.16 | learning rate: 1.850E-05 | global batch size: 16 | lm loss: 5.508449E+00 | grad norm: 0.889 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3530/ 128728 | consumed samples: 56480 | consumed tokens: 115671040 | elapsed time per iteration (s): 15.23 | learning rate: 1.851E-05 | global batch size: 16 | lm loss: 5.647694E+00 | grad norm: 0.876 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3531/ 128728 | consumed samples: 56496 | consumed tokens: 115703808 | elapsed time per iteration (s): 15.18 | learning rate: 1.851E-05 | global batch size: 16 | lm loss: 5.667585E+00 | grad norm: 0.698 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3532/ 128728 | consumed samples: 56512 | consumed tokens: 115736576 | elapsed time per iteration (s): 15.20 | learning rate: 1.852E-05 | global batch size: 16 | lm loss: 5.511345E+00 | grad norm: 0.908 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3533/ 128728 | consumed samples: 56528 | consumed tokens: 115769344 | elapsed time per iteration (s): 15.18 | learning rate: 1.852E-05 | global batch size: 16 | lm loss: 5.457089E+00 | grad norm: 0.904 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3534/ 128728 | consumed samples: 56544 | consumed tokens: 115802112 | elapsed time per iteration (s): 15.18 | learning rate: 1.853E-05 | global batch size: 16 | lm loss: 5.487876E+00 | grad norm: 0.738 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3535/ 128728 | consumed samples: 56560 | consumed tokens: 115834880 | elapsed time per iteration (s): 15.24 | learning rate: 1.853E-05 | global batch size: 16 | lm loss: 5.762383E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3536/ 128728 | consumed samples: 56576 | consumed tokens: 115867648 | elapsed time per iteration (s): 15.20 | learning rate: 1.854E-05 | global batch size: 16 | lm loss: 5.579982E+00 | grad norm: 0.858 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3537/ 128728 | consumed samples: 56592 | consumed tokens: 115900416 | elapsed time per iteration (s): 15.24 | learning rate: 1.854E-05 | global batch size: 16 | lm loss: 5.651605E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3538/ 128728 | consumed samples: 56608 | consumed tokens: 115933184 | elapsed time per iteration (s): 15.20 | learning rate: 1.855E-05 | global batch size: 16 | lm loss: 5.665345E+00 | grad norm: 0.716 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3539/ 128728 | consumed samples: 56624 | consumed tokens: 115965952 | elapsed time per iteration (s): 15.18 | learning rate: 1.855E-05 | global batch size: 16 | lm loss: 5.426301E+00 | grad norm: 0.702 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3540/ 128728 | consumed samples: 56640 | consumed tokens: 115998720 | elapsed time per iteration (s): 15.20 | learning rate: 1.856E-05 | global batch size: 16 | lm loss: 5.570403E+00 | grad norm: 0.792 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3541/ 128728 | consumed samples: 56656 | consumed tokens: 116031488 | elapsed time per iteration (s): 15.18 | learning rate: 1.857E-05 | global batch size: 16 | lm loss: 5.609330E+00 | grad norm: 0.893 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3542/ 128728 | consumed samples: 56672 | consumed tokens: 116064256 | elapsed time per iteration (s): 15.21 | learning rate: 1.857E-05 | global batch size: 16 | lm loss: 5.696312E+00 | grad norm: 0.759 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3543/ 128728 | consumed samples: 56688 | consumed tokens: 116097024 | elapsed time per iteration (s): 15.18 | learning rate: 1.858E-05 | global batch size: 16 | lm loss: 5.457860E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3544/ 128728 | consumed samples: 56704 | consumed tokens: 116129792 | elapsed time per iteration (s): 15.22 | learning rate: 1.858E-05 | global batch size: 16 | lm loss: 5.377970E+00 | grad norm: 0.678 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3545/ 128728 | consumed samples: 56720 | consumed tokens: 116162560 | elapsed time per iteration (s): 15.14 | learning rate: 1.859E-05 | global batch size: 16 | lm loss: 5.565271E+00 | grad norm: 0.695 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 3546/ 128728 | consumed samples: 56736 | consumed tokens: 116195328 | elapsed time per iteration (s): 15.19 | learning rate: 1.859E-05 | global batch size: 16 | lm loss: 5.541815E+00 | grad norm: 0.710 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3547/ 128728 | consumed samples: 56752 | consumed tokens: 116228096 | elapsed time per iteration (s): 15.22 | learning rate: 1.860E-05 | global batch size: 16 | lm loss: 5.579144E+00 | grad norm: 0.758 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3548/ 128728 | consumed samples: 56768 | consumed tokens: 116260864 | elapsed time per iteration (s): 15.24 | learning rate: 1.860E-05 | global batch size: 16 | lm loss: 5.499104E+00 | grad norm: 0.694 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3549/ 128728 | consumed samples: 56784 | consumed tokens: 116293632 | elapsed time per iteration (s): 15.21 | learning rate: 1.861E-05 | global batch size: 16 | lm loss: 5.444351E+00 | grad norm: 1.107 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3550/ 128728 | consumed samples: 56800 | consumed tokens: 116326400 | elapsed time per iteration (s): 15.25 | learning rate: 1.861E-05 | global batch size: 16 | lm loss: 5.384247E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.04 | [default7]: iteration 3551/ 128728 | consumed samples: 56816 | consumed tokens: 116359168 | elapsed time per iteration (s): 15.25 | learning rate: 1.862E-05 | global batch size: 16 | lm loss: 5.644943E+00 | grad norm: 0.869 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3552/ 128728 | consumed samples: 56832 | consumed tokens: 116391936 | elapsed time per iteration (s): 15.14 | learning rate: 1.862E-05 | global batch size: 16 | lm loss: 5.620580E+00 | grad norm: 0.687 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 3553/ 128728 | consumed samples: 56848 | consumed tokens: 116424704 | elapsed time per iteration (s): 15.19 | learning rate: 1.863E-05 | global batch size: 16 | lm loss: 5.781569E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3554/ 128728 | consumed samples: 56864 | consumed tokens: 116457472 | elapsed time per iteration (s): 15.17 | learning rate: 1.863E-05 | global batch size: 16 | lm loss: 5.655607E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3555/ 128728 | consumed samples: 56880 | consumed tokens: 116490240 | elapsed time per iteration (s): 15.15 | learning rate: 1.864E-05 | global batch size: 16 | lm loss: 5.440409E+00 | grad norm: 0.840 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 3556/ 128728 | consumed samples: 56896 | consumed tokens: 116523008 | elapsed time per iteration (s): 15.21 | learning rate: 1.864E-05 | global batch size: 16 | lm loss: 5.547821E+00 | grad norm: 0.720 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3557/ 128728 | consumed samples: 56912 | consumed tokens: 116555776 | elapsed time per iteration (s): 15.21 | learning rate: 1.865E-05 | global batch size: 16 | lm loss: 5.477743E+00 | grad norm: 0.815 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3558/ 128728 | consumed samples: 56928 | consumed tokens: 116588544 | elapsed time per iteration (s): 15.18 | learning rate: 1.865E-05 | global batch size: 16 | lm loss: 5.752525E+00 | grad norm: 0.786 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3559/ 128728 | consumed samples: 56944 | consumed tokens: 116621312 | elapsed time per iteration (s): 15.21 | learning rate: 1.866E-05 | global batch size: 16 | lm loss: 5.561419E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3560/ 128728 | consumed samples: 56960 | consumed tokens: 116654080 | elapsed time per iteration (s): 15.21 | learning rate: 1.866E-05 | global batch size: 16 | lm loss: 5.594239E+00 | grad norm: 0.917 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3561/ 128728 | consumed samples: 56976 | consumed tokens: 116686848 | elapsed time per iteration (s): 15.22 | learning rate: 1.867E-05 | global batch size: 16 | lm loss: 5.820959E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3562/ 128728 | consumed samples: 56992 | consumed tokens: 116719616 | elapsed time per iteration (s): 15.22 | learning rate: 1.868E-05 | global batch size: 16 | lm loss: 5.637988E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3563/ 128728 | consumed samples: 57008 | consumed tokens: 116752384 | elapsed time per iteration (s): 15.21 | learning rate: 1.868E-05 | global batch size: 16 | lm loss: 5.698305E+00 | grad norm: 0.813 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3564/ 128728 | consumed samples: 57024 | consumed tokens: 116785152 | elapsed time per iteration (s): 15.19 | learning rate: 1.869E-05 | global batch size: 16 | lm loss: 5.562874E+00 | grad norm: 0.838 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 3565/ 128728 | consumed samples: 57040 | consumed tokens: 116817920 | elapsed time per iteration (s): 15.22 | learning rate: 1.869E-05 | global batch size: 16 | lm loss: 5.334938E+00 | grad norm: 1.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3566/ 128728 | consumed samples: 57056 | consumed tokens: 116850688 | elapsed time per iteration (s): 15.19 | learning rate: 1.870E-05 | global batch size: 16 | lm loss: 5.632464E+00 | grad norm: 0.758 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3567/ 128728 | consumed samples: 57072 | consumed tokens: 116883456 | elapsed time per iteration (s): 15.17 | learning rate: 1.870E-05 | global batch size: 16 | lm loss: 5.528159E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 3568/ 128728 | consumed samples: 57088 | consumed tokens: 116916224 | elapsed time per iteration (s): 15.23 | learning rate: 1.871E-05 | global batch size: 16 | lm loss: 5.648876E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3569/ 128728 | consumed samples: 57104 | consumed tokens: 116948992 | elapsed time per iteration (s): 15.22 | learning rate: 1.871E-05 | global batch size: 16 | lm loss: 5.652656E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3570/ 128728 | consumed samples: 57120 | consumed tokens: 116981760 | elapsed time per iteration (s): 15.17 | learning rate: 1.872E-05 | global batch size: 16 | lm loss: 5.550305E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3571/ 128728 | consumed samples: 57136 | consumed tokens: 117014528 | elapsed time per iteration (s): 15.19 | learning rate: 1.872E-05 | global batch size: 16 | lm loss: 5.244218E+00 | grad norm: 0.892 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3572/ 128728 | consumed samples: 57152 | consumed tokens: 117047296 | elapsed time per iteration (s): 15.23 | learning rate: 1.873E-05 | global batch size: 16 | lm loss: 5.495933E+00 | grad norm: 0.840 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3573/ 128728 | consumed samples: 57168 | consumed tokens: 117080064 | elapsed time per iteration (s): 15.20 | learning rate: 1.873E-05 | global batch size: 16 | lm loss: 5.597926E+00 | grad norm: 0.853 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3574/ 128728 | consumed samples: 57184 | consumed tokens: 117112832 | elapsed time per iteration (s): 15.24 | learning rate: 1.874E-05 | global batch size: 16 | lm loss: 5.457273E+00 | grad norm: 0.871 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3575/ 128728 | consumed samples: 57200 | consumed tokens: 117145600 | elapsed time per iteration (s): 15.14 | learning rate: 1.874E-05 | global batch size: 16 | lm loss: 5.458507E+00 | grad norm: 0.800 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 3576/ 128728 | consumed samples: 57216 | consumed tokens: 117178368 | elapsed time per iteration (s): 15.24 | learning rate: 1.875E-05 | global batch size: 16 | lm loss: 5.373246E+00 | grad norm: 0.831 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3577/ 128728 | consumed samples: 57232 | consumed tokens: 117211136 | elapsed time per iteration (s): 15.15 | learning rate: 1.875E-05 | global batch size: 16 | lm loss: 5.753658E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 3578/ 128728 | consumed samples: 57248 | consumed tokens: 117243904 | elapsed time per iteration (s): 15.24 | learning rate: 1.876E-05 | global batch size: 16 | lm loss: 5.383017E+00 | grad norm: 0.846 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3579/ 128728 | consumed samples: 57264 | consumed tokens: 117276672 | elapsed time per iteration (s): 15.23 | learning rate: 1.876E-05 | global batch size: 16 | lm loss: 5.562088E+00 | grad norm: 0.843 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3580/ 128728 | consumed samples: 57280 | consumed tokens: 117309440 | elapsed time per iteration (s): 15.22 | learning rate: 1.877E-05 | global batch size: 16 | lm loss: 5.501311E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3581/ 128728 | consumed samples: 57296 | consumed tokens: 117342208 | elapsed time per iteration (s): 15.23 | learning rate: 1.877E-05 | global batch size: 16 | lm loss: 5.498442E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3582/ 128728 | consumed samples: 57312 | consumed tokens: 117374976 | elapsed time per iteration (s): 15.20 | learning rate: 1.878E-05 | global batch size: 16 | lm loss: 5.639647E+00 | grad norm: 0.689 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3583/ 128728 | consumed samples: 57328 | consumed tokens: 117407744 | elapsed time per iteration (s): 15.21 | learning rate: 1.879E-05 | global batch size: 16 | lm loss: 5.382240E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3584/ 128728 | consumed samples: 57344 | consumed tokens: 117440512 | elapsed time per iteration (s): 15.22 | learning rate: 1.879E-05 | global batch size: 16 | lm loss: 5.583954E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3585/ 128728 | consumed samples: 57360 | consumed tokens: 117473280 | elapsed time per iteration (s): 15.23 | learning rate: 1.880E-05 | global batch size: 16 | lm loss: 5.507063E+00 | grad norm: 0.995 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3586/ 128728 | consumed samples: 57376 | consumed tokens: 117506048 | elapsed time per iteration (s): 15.20 | learning rate: 1.880E-05 | global batch size: 16 | lm loss: 5.361601E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3587/ 128728 | consumed samples: 57392 | consumed tokens: 117538816 | elapsed time per iteration (s): 15.17 | learning rate: 1.881E-05 | global batch size: 16 | lm loss: 5.580978E+00 | grad norm: 0.665 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3588/ 128728 | consumed samples: 57408 | consumed tokens: 117571584 | elapsed time per iteration (s): 15.22 | learning rate: 1.881E-05 | global batch size: 16 | lm loss: 5.566353E+00 | grad norm: 1.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3589/ 128728 | consumed samples: 57424 | consumed tokens: 117604352 | elapsed time per iteration (s): 15.21 | learning rate: 1.882E-05 | global batch size: 16 | lm loss: 5.816396E+00 | grad norm: 0.762 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3590/ 128728 | consumed samples: 57440 | consumed tokens: 117637120 | elapsed time per iteration (s): 15.17 | learning rate: 1.882E-05 | global batch size: 16 | lm loss: 5.590647E+00 | grad norm: 0.705 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 3591/ 128728 | consumed samples: 57456 | consumed tokens: 117669888 | elapsed time per iteration (s): 15.21 | learning rate: 1.883E-05 | global batch size: 16 | lm loss: 5.503424E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3592/ 128728 | consumed samples: 57472 | consumed tokens: 117702656 | elapsed time per iteration (s): 15.20 | learning rate: 1.883E-05 | global batch size: 16 | lm loss: 5.546864E+00 | grad norm: 0.695 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3593/ 128728 | consumed samples: 57488 | consumed tokens: 117735424 | elapsed time per iteration (s): 15.17 | learning rate: 1.884E-05 | global batch size: 16 | lm loss: 5.547342E+00 | grad norm: 0.721 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3594/ 128728 | consumed samples: 57504 | consumed tokens: 117768192 | elapsed time per iteration (s): 15.21 | learning rate: 1.884E-05 | global batch size: 16 | lm loss: 5.557863E+00 | grad norm: 0.871 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3595/ 128728 | consumed samples: 57520 | consumed tokens: 117800960 | elapsed time per iteration (s): 15.20 | learning rate: 1.885E-05 | global batch size: 16 | lm loss: 5.362972E+00 | grad norm: 1.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3596/ 128728 | consumed samples: 57536 | consumed tokens: 117833728 | elapsed time per iteration (s): 15.22 | learning rate: 1.885E-05 | global batch size: 16 | lm loss: 5.553192E+00 | grad norm: 0.665 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3597/ 128728 | consumed samples: 57552 | consumed tokens: 117866496 | elapsed time per iteration (s): 15.23 | learning rate: 1.886E-05 | global batch size: 16 | lm loss: 5.183071E+00 | grad norm: 0.941 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3598/ 128728 | consumed samples: 57568 | consumed tokens: 117899264 | elapsed time per iteration (s): 15.22 | learning rate: 1.886E-05 | global batch size: 16 | lm loss: 5.619958E+00 | grad norm: 0.705 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3599/ 128728 | consumed samples: 57584 | consumed tokens: 117932032 | elapsed time per iteration (s): 15.24 | learning rate: 1.887E-05 | global batch size: 16 | lm loss: 5.533691E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3600/ 128728 | consumed samples: 57600 | consumed tokens: 117964800 | elapsed time per iteration (s): 15.23 | learning rate: 1.887E-05 | global batch size: 16 | lm loss: 5.799161E+00 | grad norm: 0.835 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3601/ 128728 | consumed samples: 57616 | consumed tokens: 117997568 | elapsed time per iteration (s): 15.19 | learning rate: 1.888E-05 | global batch size: 16 | lm loss: 5.572991E+00 | grad norm: 0.971 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 3602/ 128728 | consumed samples: 57632 | consumed tokens: 118030336 | elapsed time per iteration (s): 15.22 | learning rate: 1.888E-05 | global batch size: 16 | lm loss: 5.398020E+00 | grad norm: 0.659 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3603/ 128728 | consumed samples: 57648 | consumed tokens: 118063104 | elapsed time per iteration (s): 15.19 | learning rate: 1.889E-05 | global batch size: 16 | lm loss: 5.634165E+00 | grad norm: 0.744 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3604/ 128728 | consumed samples: 57664 | consumed tokens: 118095872 | elapsed time per iteration (s): 15.21 | learning rate: 1.890E-05 | global batch size: 16 | lm loss: 5.592557E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3605/ 128728 | consumed samples: 57680 | consumed tokens: 118128640 | elapsed time per iteration (s): 15.15 | learning rate: 1.890E-05 | global batch size: 16 | lm loss: 5.571424E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 3606/ 128728 | consumed samples: 57696 | consumed tokens: 118161408 | elapsed time per iteration (s): 15.17 | learning rate: 1.891E-05 | global batch size: 16 | lm loss: 5.605553E+00 | grad norm: 0.787 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3607/ 128728 | consumed samples: 57712 | consumed tokens: 118194176 | elapsed time per iteration (s): 15.20 | learning rate: 1.891E-05 | global batch size: 16 | lm loss: 5.751481E+00 | grad norm: 0.684 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3608/ 128728 | consumed samples: 57728 | consumed tokens: 118226944 | elapsed time per iteration (s): 15.20 | learning rate: 1.892E-05 | global batch size: 16 | lm loss: 5.682261E+00 | grad norm: 0.699 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3609/ 128728 | consumed samples: 57744 | consumed tokens: 118259712 | elapsed time per iteration (s): 15.17 | learning rate: 1.892E-05 | global batch size: 16 | lm loss: 5.607131E+00 | grad norm: 1.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 3610/ 128728 | consumed samples: 57760 | consumed tokens: 118292480 | elapsed time per iteration (s): 15.20 | learning rate: 1.893E-05 | global batch size: 16 | lm loss: 5.541697E+00 | grad norm: 0.701 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3611/ 128728 | consumed samples: 57776 | consumed tokens: 118325248 | elapsed time per iteration (s): 15.24 | learning rate: 1.893E-05 | global batch size: 16 | lm loss: 5.603507E+00 | grad norm: 1.274 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3612/ 128728 | consumed samples: 57792 | consumed tokens: 118358016 | elapsed time per iteration (s): 15.22 | learning rate: 1.894E-05 | global batch size: 16 | lm loss: 5.721704E+00 | grad norm: 0.836 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3613/ 128728 | consumed samples: 57808 | consumed tokens: 118390784 | elapsed time per iteration (s): 15.23 | learning rate: 1.894E-05 | global batch size: 16 | lm loss: 5.661789E+00 | grad norm: 0.727 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3614/ 128728 | consumed samples: 57824 | consumed tokens: 118423552 | elapsed time per iteration (s): 15.23 | learning rate: 1.895E-05 | global batch size: 16 | lm loss: 5.765802E+00 | grad norm: 0.875 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3615/ 128728 | consumed samples: 57840 | consumed tokens: 118456320 | elapsed time per iteration (s): 15.22 | learning rate: 1.895E-05 | global batch size: 16 | lm loss: 5.475472E+00 | grad norm: 0.686 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3616/ 128728 | consumed samples: 57856 | consumed tokens: 118489088 | elapsed time per iteration (s): 15.22 | learning rate: 1.896E-05 | global batch size: 16 | lm loss: 5.469672E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3617/ 128728 | consumed samples: 57872 | consumed tokens: 118521856 | elapsed time per iteration (s): 15.22 | learning rate: 1.896E-05 | global batch size: 16 | lm loss: 5.555403E+00 | grad norm: 0.924 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3618/ 128728 | consumed samples: 57888 | consumed tokens: 118554624 | elapsed time per iteration (s): 15.21 | learning rate: 1.897E-05 | global batch size: 16 | lm loss: 5.757840E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3619/ 128728 | consumed samples: 57904 | consumed tokens: 118587392 | elapsed time per iteration (s): 15.20 | learning rate: 1.897E-05 | global batch size: 16 | lm loss: 5.454224E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3620/ 128728 | consumed samples: 57920 | consumed tokens: 118620160 | elapsed time per iteration (s): 15.24 | learning rate: 1.898E-05 | global batch size: 16 | lm loss: 5.460718E+00 | grad norm: 1.001 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3621/ 128728 | consumed samples: 57936 | consumed tokens: 118652928 | elapsed time per iteration (s): 15.23 | learning rate: 1.898E-05 | global batch size: 16 | lm loss: 5.752840E+00 | grad norm: 0.825 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3622/ 128728 | consumed samples: 57952 | consumed tokens: 118685696 | elapsed time per iteration (s): 15.22 | learning rate: 1.899E-05 | global batch size: 16 | lm loss: 5.772221E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3623/ 128728 | consumed samples: 57968 | consumed tokens: 118718464 | elapsed time per iteration (s): 15.20 | learning rate: 1.900E-05 | global batch size: 16 | lm loss: 5.500217E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3624/ 128728 | consumed samples: 57984 | consumed tokens: 118751232 | elapsed time per iteration (s): 15.20 | learning rate: 1.900E-05 | global batch size: 16 | lm loss: 5.437232E+00 | grad norm: 0.920 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3625/ 128728 | consumed samples: 58000 | consumed tokens: 118784000 | elapsed time per iteration (s): 15.23 | learning rate: 1.901E-05 | global batch size: 16 | lm loss: 5.481465E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3626/ 128728 | consumed samples: 58016 | consumed tokens: 118816768 | elapsed time per iteration (s): 15.24 | learning rate: 1.901E-05 | global batch size: 16 | lm loss: 5.507442E+00 | grad norm: 0.824 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3627/ 128728 | consumed samples: 58032 | consumed tokens: 118849536 | elapsed time per iteration (s): 15.24 | learning rate: 1.902E-05 | global batch size: 16 | lm loss: 5.689624E+00 | grad norm: 0.866 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3628/ 128728 | consumed samples: 58048 | consumed tokens: 118882304 | elapsed time per iteration (s): 15.21 | learning rate: 1.902E-05 | global batch size: 16 | lm loss: 5.502779E+00 | grad norm: 0.738 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3629/ 128728 | consumed samples: 58064 | consumed tokens: 118915072 | elapsed time per iteration (s): 15.20 | learning rate: 1.903E-05 | global batch size: 16 | lm loss: 5.628727E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3630/ 128728 | consumed samples: 58080 | consumed tokens: 118947840 | elapsed time per iteration (s): 15.20 | learning rate: 1.903E-05 | global batch size: 16 | lm loss: 5.490268E+00 | grad norm: 0.762 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3631/ 128728 | consumed samples: 58096 | consumed tokens: 118980608 | elapsed time per iteration (s): 15.25 | learning rate: 1.904E-05 | global batch size: 16 | lm loss: 5.512156E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3632/ 128728 | consumed samples: 58112 | consumed tokens: 119013376 | elapsed time per iteration (s): 15.21 | learning rate: 1.904E-05 | global batch size: 16 | lm loss: 5.381227E+00 | grad norm: 0.726 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3633/ 128728 | consumed samples: 58128 | consumed tokens: 119046144 | elapsed time per iteration (s): 15.19 | learning rate: 1.905E-05 | global batch size: 16 | lm loss: 5.426307E+00 | grad norm: 0.919 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3634/ 128728 | consumed samples: 58144 | consumed tokens: 119078912 | elapsed time per iteration (s): 15.27 | learning rate: 1.905E-05 | global batch size: 16 | lm loss: 5.770047E+00 | grad norm: 0.664 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 3635/ 128728 | consumed samples: 58160 | consumed tokens: 119111680 | elapsed time per iteration (s): 15.20 | learning rate: 1.906E-05 | global batch size: 16 | lm loss: 5.419781E+00 | grad norm: 0.835 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3636/ 128728 | consumed samples: 58176 | consumed tokens: 119144448 | elapsed time per iteration (s): 15.22 | learning rate: 1.906E-05 | global batch size: 16 | lm loss: 5.744234E+00 | grad norm: 0.900 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3637/ 128728 | consumed samples: 58192 | consumed tokens: 119177216 | elapsed time per iteration (s): 15.25 | learning rate: 1.907E-05 | global batch size: 16 | lm loss: 5.680465E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3638/ 128728 | consumed samples: 58208 | consumed tokens: 119209984 | elapsed time per iteration (s): 15.21 | learning rate: 1.907E-05 | global batch size: 16 | lm loss: 5.462650E+00 | grad norm: 0.773 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3639/ 128728 | consumed samples: 58224 | consumed tokens: 119242752 | elapsed time per iteration (s): 15.23 | learning rate: 1.908E-05 | global batch size: 16 | lm loss: 5.425622E+00 | grad norm: 0.716 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3640/ 128728 | consumed samples: 58240 | consumed tokens: 119275520 | elapsed time per iteration (s): 15.20 | learning rate: 1.908E-05 | global batch size: 16 | lm loss: 5.565685E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3641/ 128728 | consumed samples: 58256 | consumed tokens: 119308288 | elapsed time per iteration (s): 15.24 | learning rate: 1.909E-05 | global batch size: 16 | lm loss: 5.555475E+00 | grad norm: 1.392 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3642/ 128728 | consumed samples: 58272 | consumed tokens: 119341056 | elapsed time per iteration (s): 15.23 | learning rate: 1.909E-05 | global batch size: 16 | lm loss: 5.856975E+00 | grad norm: 1.010 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3643/ 128728 | consumed samples: 58288 | consumed tokens: 119373824 | elapsed time per iteration (s): 15.16 | learning rate: 1.910E-05 | global batch size: 16 | lm loss: 5.520800E+00 | grad norm: 0.788 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3644/ 128728 | consumed samples: 58304 | consumed tokens: 119406592 | elapsed time per iteration (s): 15.17 | learning rate: 1.911E-05 | global batch size: 16 | lm loss: 5.231161E+00 | grad norm: 0.782 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3645/ 128728 | consumed samples: 58320 | consumed tokens: 119439360 | elapsed time per iteration (s): 15.20 | learning rate: 1.911E-05 | global batch size: 16 | lm loss: 5.715312E+00 | grad norm: 0.890 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3646/ 128728 | consumed samples: 58336 | consumed tokens: 119472128 | elapsed time per iteration (s): 15.21 | learning rate: 1.912E-05 | global batch size: 16 | lm loss: 5.264447E+00 | grad norm: 0.976 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3647/ 128728 | consumed samples: 58352 | consumed tokens: 119504896 | elapsed time per iteration (s): 15.23 | learning rate: 1.912E-05 | global batch size: 16 | lm loss: 5.607737E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3648/ 128728 | consumed samples: 58368 | consumed tokens: 119537664 | elapsed time per iteration (s): 15.16 | learning rate: 1.913E-05 | global batch size: 16 | lm loss: 5.569743E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 3649/ 128728 | consumed samples: 58384 | consumed tokens: 119570432 | elapsed time per iteration (s): 15.18 | learning rate: 1.913E-05 | global batch size: 16 | lm loss: 5.634804E+00 | grad norm: 0.822 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3650/ 128728 | consumed samples: 58400 | consumed tokens: 119603200 | elapsed time per iteration (s): 15.20 | learning rate: 1.914E-05 | global batch size: 16 | lm loss: 5.602137E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3651/ 128728 | consumed samples: 58416 | consumed tokens: 119635968 | elapsed time per iteration (s): 15.22 | learning rate: 1.914E-05 | global batch size: 16 | lm loss: 5.597826E+00 | grad norm: 1.077 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3652/ 128728 | consumed samples: 58432 | consumed tokens: 119668736 | elapsed time per iteration (s): 15.21 | learning rate: 1.915E-05 | global batch size: 16 | lm loss: 5.697678E+00 | grad norm: 0.896 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3653/ 128728 | consumed samples: 58448 | consumed tokens: 119701504 | elapsed time per iteration (s): 15.24 | learning rate: 1.915E-05 | global batch size: 16 | lm loss: 6.026344E+00 | grad norm: 0.868 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3654/ 128728 | consumed samples: 58464 | consumed tokens: 119734272 | elapsed time per iteration (s): 15.15 | learning rate: 1.916E-05 | global batch size: 16 | lm loss: 5.696335E+00 | grad norm: 0.819 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 3655/ 128728 | consumed samples: 58480 | consumed tokens: 119767040 | elapsed time per iteration (s): 15.23 | learning rate: 1.916E-05 | global batch size: 16 | lm loss: 5.686172E+00 | grad norm: 0.729 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3656/ 128728 | consumed samples: 58496 | consumed tokens: 119799808 | elapsed time per iteration (s): 15.23 | learning rate: 1.917E-05 | global batch size: 16 | lm loss: 5.681462E+00 | grad norm: 1.222 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3657/ 128728 | consumed samples: 58512 | consumed tokens: 119832576 | elapsed time per iteration (s): 15.22 | learning rate: 1.917E-05 | global batch size: 16 | lm loss: 5.347002E+00 | grad norm: 0.729 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3658/ 128728 | consumed samples: 58528 | consumed tokens: 119865344 | elapsed time per iteration (s): 15.20 | learning rate: 1.918E-05 | global batch size: 16 | lm loss: 5.446877E+00 | grad norm: 0.699 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3659/ 128728 | consumed samples: 58544 | consumed tokens: 119898112 | elapsed time per iteration (s): 15.22 | learning rate: 1.918E-05 | global batch size: 16 | lm loss: 5.406514E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3660/ 128728 | consumed samples: 58560 | consumed tokens: 119930880 | elapsed time per iteration (s): 15.20 | learning rate: 1.919E-05 | global batch size: 16 | lm loss: 5.452915E+00 | grad norm: 1.766 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3661/ 128728 | consumed samples: 58576 | consumed tokens: 119963648 | elapsed time per iteration (s): 15.20 | learning rate: 1.919E-05 | global batch size: 16 | lm loss: 5.661624E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3662/ 128728 | consumed samples: 58592 | consumed tokens: 119996416 | elapsed time per iteration (s): 15.23 | learning rate: 1.920E-05 | global batch size: 16 | lm loss: 5.449157E+00 | grad norm: 0.824 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3663/ 128728 | consumed samples: 58608 | consumed tokens: 120029184 | elapsed time per iteration (s): 15.23 | learning rate: 1.920E-05 | global batch size: 16 | lm loss: 5.559745E+00 | grad norm: 0.766 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3664/ 128728 | consumed samples: 58624 | consumed tokens: 120061952 | elapsed time per iteration (s): 15.21 | learning rate: 1.921E-05 | global batch size: 16 | lm loss: 5.657228E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3665/ 128728 | consumed samples: 58640 | consumed tokens: 120094720 | elapsed time per iteration (s): 15.20 | learning rate: 1.922E-05 | global batch size: 16 | lm loss: 5.547557E+00 | grad norm: 0.818 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3666/ 128728 | consumed samples: 58656 | consumed tokens: 120127488 | elapsed time per iteration (s): 15.19 | learning rate: 1.922E-05 | global batch size: 16 | lm loss: 5.483784E+00 | grad norm: 0.669 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3667/ 128728 | consumed samples: 58672 | consumed tokens: 120160256 | elapsed time per iteration (s): 15.23 | learning rate: 1.923E-05 | global batch size: 16 | lm loss: 5.720974E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3668/ 128728 | consumed samples: 58688 | consumed tokens: 120193024 | elapsed time per iteration (s): 15.13 | learning rate: 1.923E-05 | global batch size: 16 | lm loss: 5.629973E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 3669/ 128728 | consumed samples: 58704 | consumed tokens: 120225792 | elapsed time per iteration (s): 15.25 | learning rate: 1.924E-05 | global batch size: 16 | lm loss: 5.490295E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3670/ 128728 | consumed samples: 58720 | consumed tokens: 120258560 | elapsed time per iteration (s): 15.20 | learning rate: 1.924E-05 | global batch size: 16 | lm loss: 5.663823E+00 | grad norm: 0.654 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3671/ 128728 | consumed samples: 58736 | consumed tokens: 120291328 | elapsed time per iteration (s): 15.23 | learning rate: 1.925E-05 | global batch size: 16 | lm loss: 5.565134E+00 | grad norm: 1.630 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3672/ 128728 | consumed samples: 58752 | consumed tokens: 120324096 | elapsed time per iteration (s): 15.27 | learning rate: 1.925E-05 | global batch size: 16 | lm loss: 5.505857E+00 | grad norm: 0.741 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 3673/ 128728 | consumed samples: 58768 | consumed tokens: 120356864 | elapsed time per iteration (s): 15.25 | learning rate: 1.926E-05 | global batch size: 16 | lm loss: 5.505276E+00 | grad norm: 1.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3674/ 128728 | consumed samples: 58784 | consumed tokens: 120389632 | elapsed time per iteration (s): 15.21 | learning rate: 1.926E-05 | global batch size: 16 | lm loss: 5.554258E+00 | grad norm: 1.052 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3675/ 128728 | consumed samples: 58800 | consumed tokens: 120422400 | elapsed time per iteration (s): 15.26 | learning rate: 1.927E-05 | global batch size: 16 | lm loss: 5.709059E+00 | grad norm: 1.228 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 3676/ 128728 | consumed samples: 58816 | consumed tokens: 120455168 | elapsed time per iteration (s): 15.23 | learning rate: 1.927E-05 | global batch size: 16 | lm loss: 5.620901E+00 | grad norm: 0.755 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3677/ 128728 | consumed samples: 58832 | consumed tokens: 120487936 | elapsed time per iteration (s): 15.18 | learning rate: 1.928E-05 | global batch size: 16 | lm loss: 5.432440E+00 | grad norm: 0.791 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3678/ 128728 | consumed samples: 58848 | consumed tokens: 120520704 | elapsed time per iteration (s): 15.22 | learning rate: 1.928E-05 | global batch size: 16 | lm loss: 5.577560E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3679/ 128728 | consumed samples: 58864 | consumed tokens: 120553472 | elapsed time per iteration (s): 15.23 | learning rate: 1.929E-05 | global batch size: 16 | lm loss: 5.770396E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3680/ 128728 | consumed samples: 58880 | consumed tokens: 120586240 | elapsed time per iteration (s): 15.18 | learning rate: 1.929E-05 | global batch size: 16 | lm loss: 5.468989E+00 | grad norm: 1.225 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3681/ 128728 | consumed samples: 58896 | consumed tokens: 120619008 | elapsed time per iteration (s): 15.23 | learning rate: 1.930E-05 | global batch size: 16 | lm loss: 5.550355E+00 | grad norm: 1.045 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3682/ 128728 | consumed samples: 58912 | consumed tokens: 120651776 | elapsed time per iteration (s): 15.23 | learning rate: 1.930E-05 | global batch size: 16 | lm loss: 5.747350E+00 | grad norm: 0.840 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3683/ 128728 | consumed samples: 58928 | consumed tokens: 120684544 | elapsed time per iteration (s): 15.23 | learning rate: 1.931E-05 | global batch size: 16 | lm loss: 5.518972E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3684/ 128728 | consumed samples: 58944 | consumed tokens: 120717312 | elapsed time per iteration (s): 15.16 | learning rate: 1.931E-05 | global batch size: 16 | lm loss: 5.709407E+00 | grad norm: 0.700 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3685/ 128728 | consumed samples: 58960 | consumed tokens: 120750080 | elapsed time per iteration (s): 15.18 | learning rate: 1.932E-05 | global batch size: 16 | lm loss: 5.524895E+00 | grad norm: 1.254 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3686/ 128728 | consumed samples: 58976 | consumed tokens: 120782848 | elapsed time per iteration (s): 15.22 | learning rate: 1.933E-05 | global batch size: 16 | lm loss: 5.538244E+00 | grad norm: 0.788 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3687/ 128728 | consumed samples: 58992 | consumed tokens: 120815616 | elapsed time per iteration (s): 15.22 | learning rate: 1.933E-05 | global batch size: 16 | lm loss: 5.499589E+00 | grad norm: 0.755 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3688/ 128728 | consumed samples: 59008 | consumed tokens: 120848384 | elapsed time per iteration (s): 15.27 | learning rate: 1.934E-05 | global batch size: 16 | lm loss: 5.613136E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 3689/ 128728 | consumed samples: 59024 | consumed tokens: 120881152 | elapsed time per iteration (s): 15.22 | learning rate: 1.934E-05 | global batch size: 16 | lm loss: 5.585117E+00 | grad norm: 0.787 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3690/ 128728 | consumed samples: 59040 | consumed tokens: 120913920 | elapsed time per iteration (s): 15.26 | learning rate: 1.935E-05 | global batch size: 16 | lm loss: 5.569749E+00 | grad norm: 0.939 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 3691/ 128728 | consumed samples: 59056 | consumed tokens: 120946688 | elapsed time per iteration (s): 15.23 | learning rate: 1.935E-05 | global batch size: 16 | lm loss: 5.599214E+00 | grad norm: 0.787 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3692/ 128728 | consumed samples: 59072 | consumed tokens: 120979456 | elapsed time per iteration (s): 15.25 | learning rate: 1.936E-05 | global batch size: 16 | lm loss: 5.365727E+00 | grad norm: 0.751 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3693/ 128728 | consumed samples: 59088 | consumed tokens: 121012224 | elapsed time per iteration (s): 15.19 | learning rate: 1.936E-05 | global batch size: 16 | lm loss: 5.710306E+00 | grad norm: 0.951 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3694/ 128728 | consumed samples: 59104 | consumed tokens: 121044992 | elapsed time per iteration (s): 15.23 | learning rate: 1.937E-05 | global batch size: 16 | lm loss: 5.315215E+00 | grad norm: 0.863 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3695/ 128728 | consumed samples: 59120 | consumed tokens: 121077760 | elapsed time per iteration (s): 15.23 | learning rate: 1.937E-05 | global batch size: 16 | lm loss: 5.607258E+00 | grad norm: 0.863 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3696/ 128728 | consumed samples: 59136 | consumed tokens: 121110528 | elapsed time per iteration (s): 15.19 | learning rate: 1.938E-05 | global batch size: 16 | lm loss: 5.551528E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3697/ 128728 | consumed samples: 59152 | consumed tokens: 121143296 | elapsed time per iteration (s): 15.24 | learning rate: 1.938E-05 | global batch size: 16 | lm loss: 5.566436E+00 | grad norm: 0.870 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3698/ 128728 | consumed samples: 59168 | consumed tokens: 121176064 | elapsed time per iteration (s): 15.20 | learning rate: 1.939E-05 | global batch size: 16 | lm loss: 5.264910E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3699/ 128728 | consumed samples: 59184 | consumed tokens: 121208832 | elapsed time per iteration (s): 15.19 | learning rate: 1.939E-05 | global batch size: 16 | lm loss: 5.505784E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3700/ 128728 | consumed samples: 59200 | consumed tokens: 121241600 | elapsed time per iteration (s): 15.21 | learning rate: 1.940E-05 | global batch size: 16 | lm loss: 5.399070E+00 | grad norm: 0.827 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3701/ 128728 | consumed samples: 59216 | consumed tokens: 121274368 | elapsed time per iteration (s): 15.25 | learning rate: 1.940E-05 | global batch size: 16 | lm loss: 5.653021E+00 | grad norm: 1.537 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3702/ 128728 | consumed samples: 59232 | consumed tokens: 121307136 | elapsed time per iteration (s): 15.25 | learning rate: 1.941E-05 | global batch size: 16 | lm loss: 5.541692E+00 | grad norm: 1.023 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3703/ 128728 | consumed samples: 59248 | consumed tokens: 121339904 | elapsed time per iteration (s): 15.16 | learning rate: 1.941E-05 | global batch size: 16 | lm loss: 5.357801E+00 | grad norm: 0.942 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3704/ 128728 | consumed samples: 59264 | consumed tokens: 121372672 | elapsed time per iteration (s): 15.23 | learning rate: 1.942E-05 | global batch size: 16 | lm loss: 5.619140E+00 | grad norm: 0.690 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3705/ 128728 | consumed samples: 59280 | consumed tokens: 121405440 | elapsed time per iteration (s): 15.18 | learning rate: 1.942E-05 | global batch size: 16 | lm loss: 5.628579E+00 | grad norm: 0.728 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3706/ 128728 | consumed samples: 59296 | consumed tokens: 121438208 | elapsed time per iteration (s): 15.22 | learning rate: 1.943E-05 | global batch size: 16 | lm loss: 5.582848E+00 | grad norm: 0.876 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3707/ 128728 | consumed samples: 59312 | consumed tokens: 121470976 | elapsed time per iteration (s): 15.23 | learning rate: 1.944E-05 | global batch size: 16 | lm loss: 5.396257E+00 | grad norm: 0.765 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3708/ 128728 | consumed samples: 59328 | consumed tokens: 121503744 | elapsed time per iteration (s): 15.23 | learning rate: 1.944E-05 | global batch size: 16 | lm loss: 5.520443E+00 | grad norm: 0.801 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3709/ 128728 | consumed samples: 59344 | consumed tokens: 121536512 | elapsed time per iteration (s): 15.25 | learning rate: 1.945E-05 | global batch size: 16 | lm loss: 5.484709E+00 | grad norm: 0.971 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3710/ 128728 | consumed samples: 59360 | consumed tokens: 121569280 | elapsed time per iteration (s): 15.17 | learning rate: 1.945E-05 | global batch size: 16 | lm loss: 5.345929E+00 | grad norm: 1.045 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3711/ 128728 | consumed samples: 59376 | consumed tokens: 121602048 | elapsed time per iteration (s): 15.24 | learning rate: 1.946E-05 | global batch size: 16 | lm loss: 5.598679E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3712/ 128728 | consumed samples: 59392 | consumed tokens: 121634816 | elapsed time per iteration (s): 15.22 | learning rate: 1.946E-05 | global batch size: 16 | lm loss: 5.456141E+00 | grad norm: 0.837 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3713/ 128728 | consumed samples: 59408 | consumed tokens: 121667584 | elapsed time per iteration (s): 15.30 | learning rate: 1.947E-05 | global batch size: 16 | lm loss: 5.482525E+00 | grad norm: 1.067 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.046 | TFLOPs: 8.01 | [default7]: iteration 3714/ 128728 | consumed samples: 59424 | consumed tokens: 121700352 | elapsed time per iteration (s): 15.25 | learning rate: 1.947E-05 | global batch size: 16 | lm loss: 5.399364E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3715/ 128728 | consumed samples: 59440 | consumed tokens: 121733120 | elapsed time per iteration (s): 15.28 | learning rate: 1.948E-05 | global batch size: 16 | lm loss: 5.519878E+00 | grad norm: 1.093 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 3716/ 128728 | consumed samples: 59456 | consumed tokens: 121765888 | elapsed time per iteration (s): 15.20 | learning rate: 1.948E-05 | global batch size: 16 | lm loss: 5.493938E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3717/ 128728 | consumed samples: 59472 | consumed tokens: 121798656 | elapsed time per iteration (s): 15.22 | learning rate: 1.949E-05 | global batch size: 16 | lm loss: 5.431820E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3718/ 128728 | consumed samples: 59488 | consumed tokens: 121831424 | elapsed time per iteration (s): 15.22 | learning rate: 1.949E-05 | global batch size: 16 | lm loss: 5.457542E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3719/ 128728 | consumed samples: 59504 | consumed tokens: 121864192 | elapsed time per iteration (s): 15.21 | learning rate: 1.950E-05 | global batch size: 16 | lm loss: 5.463506E+00 | grad norm: 0.848 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3720/ 128728 | consumed samples: 59520 | consumed tokens: 121896960 | elapsed time per iteration (s): 15.21 | learning rate: 1.950E-05 | global batch size: 16 | lm loss: 5.468750E+00 | grad norm: 0.974 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3721/ 128728 | consumed samples: 59536 | consumed tokens: 121929728 | elapsed time per iteration (s): 15.21 | learning rate: 1.951E-05 | global batch size: 16 | lm loss: 5.492259E+00 | grad norm: 1.078 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3722/ 128728 | consumed samples: 59552 | consumed tokens: 121962496 | elapsed time per iteration (s): 15.24 | learning rate: 1.951E-05 | global batch size: 16 | lm loss: 5.689316E+00 | grad norm: 2.225 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3723/ 128728 | consumed samples: 59568 | consumed tokens: 121995264 | elapsed time per iteration (s): 15.20 | learning rate: 1.952E-05 | global batch size: 16 | lm loss: 5.625181E+00 | grad norm: 0.697 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3724/ 128728 | consumed samples: 59584 | consumed tokens: 122028032 | elapsed time per iteration (s): 15.23 | learning rate: 1.952E-05 | global batch size: 16 | lm loss: 5.407440E+00 | grad norm: 0.811 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3725/ 128728 | consumed samples: 59600 | consumed tokens: 122060800 | elapsed time per iteration (s): 15.24 | learning rate: 1.953E-05 | global batch size: 16 | lm loss: 5.463798E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3726/ 128728 | consumed samples: 59616 | consumed tokens: 122093568 | elapsed time per iteration (s): 15.18 | learning rate: 1.954E-05 | global batch size: 16 | lm loss: 5.562949E+00 | grad norm: 0.991 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3727/ 128728 | consumed samples: 59632 | consumed tokens: 122126336 | elapsed time per iteration (s): 15.23 | learning rate: 1.954E-05 | global batch size: 16 | lm loss: 5.715884E+00 | grad norm: 1.345 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3728/ 128728 | consumed samples: 59648 | consumed tokens: 122159104 | elapsed time per iteration (s): 15.23 | learning rate: 1.955E-05 | global batch size: 16 | lm loss: 5.560648E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3729/ 128728 | consumed samples: 59664 | consumed tokens: 122191872 | elapsed time per iteration (s): 15.21 | learning rate: 1.955E-05 | global batch size: 16 | lm loss: 5.648838E+00 | grad norm: 0.786 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3730/ 128728 | consumed samples: 59680 | consumed tokens: 122224640 | elapsed time per iteration (s): 15.23 | learning rate: 1.956E-05 | global batch size: 16 | lm loss: 5.286009E+00 | grad norm: 0.668 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3731/ 128728 | consumed samples: 59696 | consumed tokens: 122257408 | elapsed time per iteration (s): 15.24 | learning rate: 1.956E-05 | global batch size: 16 | lm loss: 5.616099E+00 | grad norm: 0.796 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3732/ 128728 | consumed samples: 59712 | consumed tokens: 122290176 | elapsed time per iteration (s): 15.19 | learning rate: 1.957E-05 | global batch size: 16 | lm loss: 5.448312E+00 | grad norm: 1.079 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 3733/ 128728 | consumed samples: 59728 | consumed tokens: 122322944 | elapsed time per iteration (s): 15.27 | learning rate: 1.957E-05 | global batch size: 16 | lm loss: 5.397096E+00 | grad norm: 0.847 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 3734/ 128728 | consumed samples: 59744 | consumed tokens: 122355712 | elapsed time per iteration (s): 15.17 | learning rate: 1.958E-05 | global batch size: 16 | lm loss: 5.431047E+00 | grad norm: 1.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 3735/ 128728 | consumed samples: 59760 | consumed tokens: 122388480 | elapsed time per iteration (s): 15.19 | learning rate: 1.958E-05 | global batch size: 16 | lm loss: 5.394694E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3736/ 128728 | consumed samples: 59776 | consumed tokens: 122421248 | elapsed time per iteration (s): 15.22 | learning rate: 1.959E-05 | global batch size: 16 | lm loss: 5.561465E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3737/ 128728 | consumed samples: 59792 | consumed tokens: 122454016 | elapsed time per iteration (s): 15.24 | learning rate: 1.959E-05 | global batch size: 16 | lm loss: 5.591651E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3738/ 128728 | consumed samples: 59808 | consumed tokens: 122486784 | elapsed time per iteration (s): 15.15 | learning rate: 1.960E-05 | global batch size: 16 | lm loss: 5.337072E+00 | grad norm: 0.904 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 3739/ 128728 | consumed samples: 59824 | consumed tokens: 122519552 | elapsed time per iteration (s): 15.19 | learning rate: 1.960E-05 | global batch size: 16 | lm loss: 5.194335E+00 | grad norm: 0.848 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3740/ 128728 | consumed samples: 59840 | consumed tokens: 122552320 | elapsed time per iteration (s): 15.23 | learning rate: 1.961E-05 | global batch size: 16 | lm loss: 5.511204E+00 | grad norm: 0.791 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3741/ 128728 | consumed samples: 59856 | consumed tokens: 122585088 | elapsed time per iteration (s): 15.22 | learning rate: 1.961E-05 | global batch size: 16 | lm loss: 5.479012E+00 | grad norm: 0.834 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3742/ 128728 | consumed samples: 59872 | consumed tokens: 122617856 | elapsed time per iteration (s): 15.22 | learning rate: 1.962E-05 | global batch size: 16 | lm loss: 5.472262E+00 | grad norm: 0.665 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3743/ 128728 | consumed samples: 59888 | consumed tokens: 122650624 | elapsed time per iteration (s): 15.22 | learning rate: 1.962E-05 | global batch size: 16 | lm loss: 5.323508E+00 | grad norm: 0.973 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3744/ 128728 | consumed samples: 59904 | consumed tokens: 122683392 | elapsed time per iteration (s): 15.24 | learning rate: 1.963E-05 | global batch size: 16 | lm loss: 5.851313E+00 | grad norm: 0.750 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3745/ 128728 | consumed samples: 59920 | consumed tokens: 122716160 | elapsed time per iteration (s): 15.25 | learning rate: 1.963E-05 | global batch size: 16 | lm loss: 5.437391E+00 | grad norm: 0.882 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3746/ 128728 | consumed samples: 59936 | consumed tokens: 122748928 | elapsed time per iteration (s): 15.20 | learning rate: 1.964E-05 | global batch size: 16 | lm loss: 5.474227E+00 | grad norm: 0.963 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3747/ 128728 | consumed samples: 59952 | consumed tokens: 122781696 | elapsed time per iteration (s): 15.18 | learning rate: 1.965E-05 | global batch size: 16 | lm loss: 5.767534E+00 | grad norm: 0.803 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3748/ 128728 | consumed samples: 59968 | consumed tokens: 122814464 | elapsed time per iteration (s): 15.23 | learning rate: 1.965E-05 | global batch size: 16 | lm loss: 5.451702E+00 | grad norm: 1.019 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3749/ 128728 | consumed samples: 59984 | consumed tokens: 122847232 | elapsed time per iteration (s): 15.24 | learning rate: 1.966E-05 | global batch size: 16 | lm loss: 5.430769E+00 | grad norm: 0.762 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3750/ 128728 | consumed samples: 60000 | consumed tokens: 122880000 | elapsed time per iteration (s): 15.23 | learning rate: 1.966E-05 | global batch size: 16 | lm loss: 5.410946E+00 | grad norm: 0.859 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3751/ 128728 | consumed samples: 60016 | consumed tokens: 122912768 | elapsed time per iteration (s): 15.21 | learning rate: 1.967E-05 | global batch size: 16 | lm loss: 5.695611E+00 | grad norm: 0.688 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3752/ 128728 | consumed samples: 60032 | consumed tokens: 122945536 | elapsed time per iteration (s): 15.21 | learning rate: 1.967E-05 | global batch size: 16 | lm loss: 5.481835E+00 | grad norm: 0.690 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3753/ 128728 | consumed samples: 60048 | consumed tokens: 122978304 | elapsed time per iteration (s): 15.23 | learning rate: 1.968E-05 | global batch size: 16 | lm loss: 5.562807E+00 | grad norm: 0.822 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3754/ 128728 | consumed samples: 60064 | consumed tokens: 123011072 | elapsed time per iteration (s): 15.25 | learning rate: 1.968E-05 | global batch size: 16 | lm loss: 5.459926E+00 | grad norm: 0.806 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3755/ 128728 | consumed samples: 60080 | consumed tokens: 123043840 | elapsed time per iteration (s): 15.24 | learning rate: 1.969E-05 | global batch size: 16 | lm loss: 5.412202E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3756/ 128728 | consumed samples: 60096 | consumed tokens: 123076608 | elapsed time per iteration (s): 15.19 | learning rate: 1.969E-05 | global batch size: 16 | lm loss: 5.587904E+00 | grad norm: 0.701 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3757/ 128728 | consumed samples: 60112 | consumed tokens: 123109376 | elapsed time per iteration (s): 15.23 | learning rate: 1.970E-05 | global batch size: 16 | lm loss: 5.509131E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3758/ 128728 | consumed samples: 60128 | consumed tokens: 123142144 | elapsed time per iteration (s): 15.23 | learning rate: 1.970E-05 | global batch size: 16 | lm loss: 5.465936E+00 | grad norm: 0.843 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3759/ 128728 | consumed samples: 60144 | consumed tokens: 123174912 | elapsed time per iteration (s): 15.23 | learning rate: 1.971E-05 | global batch size: 16 | lm loss: 5.438951E+00 | grad norm: 0.815 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3760/ 128728 | consumed samples: 60160 | consumed tokens: 123207680 | elapsed time per iteration (s): 15.24 | learning rate: 1.971E-05 | global batch size: 16 | lm loss: 5.498137E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3761/ 128728 | consumed samples: 60176 | consumed tokens: 123240448 | elapsed time per iteration (s): 15.24 | learning rate: 1.972E-05 | global batch size: 16 | lm loss: 5.450524E+00 | grad norm: 1.049 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3762/ 128728 | consumed samples: 60192 | consumed tokens: 123273216 | elapsed time per iteration (s): 15.24 | learning rate: 1.972E-05 | global batch size: 16 | lm loss: 5.642553E+00 | grad norm: 0.702 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3763/ 128728 | consumed samples: 60208 | consumed tokens: 123305984 | elapsed time per iteration (s): 15.26 | learning rate: 1.973E-05 | global batch size: 16 | lm loss: 5.109872E+00 | grad norm: 0.738 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 3764/ 128728 | consumed samples: 60224 | consumed tokens: 123338752 | elapsed time per iteration (s): 15.22 | learning rate: 1.973E-05 | global batch size: 16 | lm loss: 5.403108E+00 | grad norm: 0.751 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3765/ 128728 | consumed samples: 60240 | consumed tokens: 123371520 | elapsed time per iteration (s): 15.17 | learning rate: 1.974E-05 | global batch size: 16 | lm loss: 5.276863E+00 | grad norm: 0.910 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3766/ 128728 | consumed samples: 60256 | consumed tokens: 123404288 | elapsed time per iteration (s): 15.18 | learning rate: 1.974E-05 | global batch size: 16 | lm loss: 5.535725E+00 | grad norm: 1.284 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3767/ 128728 | consumed samples: 60272 | consumed tokens: 123437056 | elapsed time per iteration (s): 15.21 | learning rate: 1.975E-05 | global batch size: 16 | lm loss: 5.349348E+00 | grad norm: 0.883 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3768/ 128728 | consumed samples: 60288 | consumed tokens: 123469824 | elapsed time per iteration (s): 15.18 | learning rate: 1.976E-05 | global batch size: 16 | lm loss: 5.511031E+00 | grad norm: 0.700 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3769/ 128728 | consumed samples: 60304 | consumed tokens: 123502592 | elapsed time per iteration (s): 15.23 | learning rate: 1.976E-05 | global batch size: 16 | lm loss: 5.505275E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3770/ 128728 | consumed samples: 60320 | consumed tokens: 123535360 | elapsed time per iteration (s): 15.22 | learning rate: 1.977E-05 | global batch size: 16 | lm loss: 5.564199E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3771/ 128728 | consumed samples: 60336 | consumed tokens: 123568128 | elapsed time per iteration (s): 15.20 | learning rate: 1.977E-05 | global batch size: 16 | lm loss: 5.466618E+00 | grad norm: 0.697 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3772/ 128728 | consumed samples: 60352 | consumed tokens: 123600896 | elapsed time per iteration (s): 15.25 | learning rate: 1.978E-05 | global batch size: 16 | lm loss: 5.439900E+00 | grad norm: 2.077 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3773/ 128728 | consumed samples: 60368 | consumed tokens: 123633664 | elapsed time per iteration (s): 15.23 | learning rate: 1.978E-05 | global batch size: 16 | lm loss: 5.633871E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3774/ 128728 | consumed samples: 60384 | consumed tokens: 123666432 | elapsed time per iteration (s): 15.23 | learning rate: 1.979E-05 | global batch size: 16 | lm loss: 5.420053E+00 | grad norm: 1.765 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3775/ 128728 | consumed samples: 60400 | consumed tokens: 123699200 | elapsed time per iteration (s): 15.24 | learning rate: 1.979E-05 | global batch size: 16 | lm loss: 5.728425E+00 | grad norm: 0.777 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3776/ 128728 | consumed samples: 60416 | consumed tokens: 123731968 | elapsed time per iteration (s): 15.23 | learning rate: 1.980E-05 | global batch size: 16 | lm loss: 5.415603E+00 | grad norm: 0.785 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3777/ 128728 | consumed samples: 60432 | consumed tokens: 123764736 | elapsed time per iteration (s): 15.21 | learning rate: 1.980E-05 | global batch size: 16 | lm loss: 5.579483E+00 | grad norm: 1.002 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3778/ 128728 | consumed samples: 60448 | consumed tokens: 123797504 | elapsed time per iteration (s): 15.25 | learning rate: 1.981E-05 | global batch size: 16 | lm loss: 5.535165E+00 | grad norm: 0.682 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3779/ 128728 | consumed samples: 60464 | consumed tokens: 123830272 | elapsed time per iteration (s): 15.25 | learning rate: 1.981E-05 | global batch size: 16 | lm loss: 5.334061E+00 | grad norm: 0.883 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3780/ 128728 | consumed samples: 60480 | consumed tokens: 123863040 | elapsed time per iteration (s): 15.21 | learning rate: 1.982E-05 | global batch size: 16 | lm loss: 5.297910E+00 | grad norm: 0.720 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3781/ 128728 | consumed samples: 60496 | consumed tokens: 123895808 | elapsed time per iteration (s): 15.21 | learning rate: 1.982E-05 | global batch size: 16 | lm loss: 5.702374E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3782/ 128728 | consumed samples: 60512 | consumed tokens: 123928576 | elapsed time per iteration (s): 15.20 | learning rate: 1.983E-05 | global batch size: 16 | lm loss: 5.441247E+00 | grad norm: 0.649 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3783/ 128728 | consumed samples: 60528 | consumed tokens: 123961344 | elapsed time per iteration (s): 15.22 | learning rate: 1.983E-05 | global batch size: 16 | lm loss: 5.571175E+00 | grad norm: 0.789 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3784/ 128728 | consumed samples: 60544 | consumed tokens: 123994112 | elapsed time per iteration (s): 15.22 | learning rate: 1.984E-05 | global batch size: 16 | lm loss: 5.425476E+00 | grad norm: 0.841 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3785/ 128728 | consumed samples: 60560 | consumed tokens: 124026880 | elapsed time per iteration (s): 15.19 | learning rate: 1.984E-05 | global batch size: 16 | lm loss: 5.498517E+00 | grad norm: 0.734 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3786/ 128728 | consumed samples: 60576 | consumed tokens: 124059648 | elapsed time per iteration (s): 15.23 | learning rate: 1.985E-05 | global batch size: 16 | lm loss: 5.395185E+00 | grad norm: 0.851 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3787/ 128728 | consumed samples: 60592 | consumed tokens: 124092416 | elapsed time per iteration (s): 15.22 | learning rate: 1.985E-05 | global batch size: 16 | lm loss: 5.543329E+00 | grad norm: 0.666 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3788/ 128728 | consumed samples: 60608 | consumed tokens: 124125184 | elapsed time per iteration (s): 15.21 | learning rate: 1.986E-05 | global batch size: 16 | lm loss: 5.629831E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3789/ 128728 | consumed samples: 60624 | consumed tokens: 124157952 | elapsed time per iteration (s): 15.21 | learning rate: 1.987E-05 | global batch size: 16 | lm loss: 5.335208E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3790/ 128728 | consumed samples: 60640 | consumed tokens: 124190720 | elapsed time per iteration (s): 15.20 | learning rate: 1.987E-05 | global batch size: 16 | lm loss: 5.583022E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3791/ 128728 | consumed samples: 60656 | consumed tokens: 124223488 | elapsed time per iteration (s): 15.21 | learning rate: 1.988E-05 | global batch size: 16 | lm loss: 5.452947E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3792/ 128728 | consumed samples: 60672 | consumed tokens: 124256256 | elapsed time per iteration (s): 15.25 | learning rate: 1.988E-05 | global batch size: 16 | lm loss: 5.283163E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3793/ 128728 | consumed samples: 60688 | consumed tokens: 124289024 | elapsed time per iteration (s): 15.23 | learning rate: 1.989E-05 | global batch size: 16 | lm loss: 5.357467E+00 | grad norm: 0.826 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3794/ 128728 | consumed samples: 60704 | consumed tokens: 124321792 | elapsed time per iteration (s): 15.21 | learning rate: 1.989E-05 | global batch size: 16 | lm loss: 5.540434E+00 | grad norm: 0.671 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3795/ 128728 | consumed samples: 60720 | consumed tokens: 124354560 | elapsed time per iteration (s): 15.23 | learning rate: 1.990E-05 | global batch size: 16 | lm loss: 5.701981E+00 | grad norm: 0.854 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3796/ 128728 | consumed samples: 60736 | consumed tokens: 124387328 | elapsed time per iteration (s): 15.22 | learning rate: 1.990E-05 | global batch size: 16 | lm loss: 5.400178E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3797/ 128728 | consumed samples: 60752 | consumed tokens: 124420096 | elapsed time per iteration (s): 15.16 | learning rate: 1.991E-05 | global batch size: 16 | lm loss: 5.381162E+00 | grad norm: 0.726 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 3798/ 128728 | consumed samples: 60768 | consumed tokens: 124452864 | elapsed time per iteration (s): 15.21 | learning rate: 1.991E-05 | global batch size: 16 | lm loss: 5.394807E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3799/ 128728 | consumed samples: 60784 | consumed tokens: 124485632 | elapsed time per iteration (s): 15.17 | learning rate: 1.992E-05 | global batch size: 16 | lm loss: 5.298486E+00 | grad norm: 0.818 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3800/ 128728 | consumed samples: 60800 | consumed tokens: 124518400 | elapsed time per iteration (s): 15.24 | learning rate: 1.992E-05 | global batch size: 16 | lm loss: 5.496459E+00 | grad norm: 0.739 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3801/ 128728 | consumed samples: 60816 | consumed tokens: 124551168 | elapsed time per iteration (s): 15.18 | learning rate: 1.993E-05 | global batch size: 16 | lm loss: 5.387410E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3802/ 128728 | consumed samples: 60832 | consumed tokens: 124583936 | elapsed time per iteration (s): 15.23 | learning rate: 1.993E-05 | global batch size: 16 | lm loss: 5.404246E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3803/ 128728 | consumed samples: 60848 | consumed tokens: 124616704 | elapsed time per iteration (s): 15.22 | learning rate: 1.994E-05 | global batch size: 16 | lm loss: 5.481224E+00 | grad norm: 0.863 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3804/ 128728 | consumed samples: 60864 | consumed tokens: 124649472 | elapsed time per iteration (s): 15.27 | learning rate: 1.994E-05 | global batch size: 16 | lm loss: 5.301341E+00 | grad norm: 1.077 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 3805/ 128728 | consumed samples: 60880 | consumed tokens: 124682240 | elapsed time per iteration (s): 15.24 | learning rate: 1.995E-05 | global batch size: 16 | lm loss: 5.260728E+00 | grad norm: 1.053 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3806/ 128728 | consumed samples: 60896 | consumed tokens: 124715008 | elapsed time per iteration (s): 15.25 | learning rate: 1.995E-05 | global batch size: 16 | lm loss: 5.525875E+00 | grad norm: 1.345 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3807/ 128728 | consumed samples: 60912 | consumed tokens: 124747776 | elapsed time per iteration (s): 15.19 | learning rate: 1.996E-05 | global batch size: 16 | lm loss: 5.592893E+00 | grad norm: 0.661 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3808/ 128728 | consumed samples: 60928 | consumed tokens: 124780544 | elapsed time per iteration (s): 15.22 | learning rate: 1.996E-05 | global batch size: 16 | lm loss: 5.427948E+00 | grad norm: 0.854 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3809/ 128728 | consumed samples: 60944 | consumed tokens: 124813312 | elapsed time per iteration (s): 15.15 | learning rate: 1.997E-05 | global batch size: 16 | lm loss: 5.401147E+00 | grad norm: 0.689 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 3810/ 128728 | consumed samples: 60960 | consumed tokens: 124846080 | elapsed time per iteration (s): 15.23 | learning rate: 1.998E-05 | global batch size: 16 | lm loss: 5.241078E+00 | grad norm: 1.217 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3811/ 128728 | consumed samples: 60976 | consumed tokens: 124878848 | elapsed time per iteration (s): 15.17 | learning rate: 1.998E-05 | global batch size: 16 | lm loss: 5.158630E+00 | grad norm: 0.710 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3812/ 128728 | consumed samples: 60992 | consumed tokens: 124911616 | elapsed time per iteration (s): 15.21 | learning rate: 1.999E-05 | global batch size: 16 | lm loss: 5.613994E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3813/ 128728 | consumed samples: 61008 | consumed tokens: 124944384 | elapsed time per iteration (s): 15.22 | learning rate: 1.999E-05 | global batch size: 16 | lm loss: 5.171216E+00 | grad norm: 0.885 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3814/ 128728 | consumed samples: 61024 | consumed tokens: 124977152 | elapsed time per iteration (s): 15.19 | learning rate: 2.000E-05 | global batch size: 16 | lm loss: 5.270428E+00 | grad norm: 0.757 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3815/ 128728 | consumed samples: 61040 | consumed tokens: 125009920 | elapsed time per iteration (s): 15.25 | learning rate: 2.000E-05 | global batch size: 16 | lm loss: 5.501937E+00 | grad norm: 0.831 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.04 | [default7]: iteration 3816/ 128728 | consumed samples: 61056 | consumed tokens: 125042688 | elapsed time per iteration (s): 15.23 | learning rate: 2.001E-05 | global batch size: 16 | lm loss: 5.503111E+00 | grad norm: 0.831 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3817/ 128728 | consumed samples: 61072 | consumed tokens: 125075456 | elapsed time per iteration (s): 15.22 | learning rate: 2.001E-05 | global batch size: 16 | lm loss: 5.680742E+00 | grad norm: 0.965 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3818/ 128728 | consumed samples: 61088 | consumed tokens: 125108224 | elapsed time per iteration (s): 15.22 | learning rate: 2.002E-05 | global batch size: 16 | lm loss: 5.501068E+00 | grad norm: 0.782 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3819/ 128728 | consumed samples: 61104 | consumed tokens: 125140992 | elapsed time per iteration (s): 15.26 | learning rate: 2.002E-05 | global batch size: 16 | lm loss: 5.319207E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 3820/ 128728 | consumed samples: 61120 | consumed tokens: 125173760 | elapsed time per iteration (s): 15.21 | learning rate: 2.003E-05 | global batch size: 16 | lm loss: 5.308980E+00 | grad norm: 0.945 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3821/ 128728 | consumed samples: 61136 | consumed tokens: 125206528 | elapsed time per iteration (s): 15.20 | learning rate: 2.003E-05 | global batch size: 16 | lm loss: 5.577042E+00 | grad norm: 1.032 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3822/ 128728 | consumed samples: 61152 | consumed tokens: 125239296 | elapsed time per iteration (s): 15.22 | learning rate: 2.004E-05 | global batch size: 16 | lm loss: 5.287234E+00 | grad norm: 1.092 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3823/ 128728 | consumed samples: 61168 | consumed tokens: 125272064 | elapsed time per iteration (s): 15.17 | learning rate: 2.004E-05 | global batch size: 16 | lm loss: 5.414005E+00 | grad norm: 0.833 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3824/ 128728 | consumed samples: 61184 | consumed tokens: 125304832 | elapsed time per iteration (s): 15.26 | learning rate: 2.005E-05 | global batch size: 16 | lm loss: 5.606541E+00 | grad norm: 1.333 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3825/ 128728 | consumed samples: 61200 | consumed tokens: 125337600 | elapsed time per iteration (s): 15.23 | learning rate: 2.005E-05 | global batch size: 16 | lm loss: 5.391608E+00 | grad norm: 0.833 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3826/ 128728 | consumed samples: 61216 | consumed tokens: 125370368 | elapsed time per iteration (s): 15.24 | learning rate: 2.006E-05 | global batch size: 16 | lm loss: 5.659523E+00 | grad norm: 2.248 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3827/ 128728 | consumed samples: 61232 | consumed tokens: 125403136 | elapsed time per iteration (s): 15.22 | learning rate: 2.006E-05 | global batch size: 16 | lm loss: 5.057670E+00 | grad norm: 0.931 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3828/ 128728 | consumed samples: 61248 | consumed tokens: 125435904 | elapsed time per iteration (s): 15.23 | learning rate: 2.007E-05 | global batch size: 16 | lm loss: 5.481532E+00 | grad norm: 0.871 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3829/ 128728 | consumed samples: 61264 | consumed tokens: 125468672 | elapsed time per iteration (s): 15.19 | learning rate: 2.008E-05 | global batch size: 16 | lm loss: 5.234412E+00 | grad norm: 0.945 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 3830/ 128728 | consumed samples: 61280 | consumed tokens: 125501440 | elapsed time per iteration (s): 15.16 | learning rate: 2.008E-05 | global batch size: 16 | lm loss: 5.504411E+00 | grad norm: 0.889 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3831/ 128728 | consumed samples: 61296 | consumed tokens: 125534208 | elapsed time per iteration (s): 15.15 | learning rate: 2.009E-05 | global batch size: 16 | lm loss: 5.468637E+00 | grad norm: 0.931 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 3832/ 128728 | consumed samples: 61312 | consumed tokens: 125566976 | elapsed time per iteration (s): 15.23 | learning rate: 2.009E-05 | global batch size: 16 | lm loss: 5.480287E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3833/ 128728 | consumed samples: 61328 | consumed tokens: 125599744 | elapsed time per iteration (s): 15.18 | learning rate: 2.010E-05 | global batch size: 16 | lm loss: 5.492439E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3834/ 128728 | consumed samples: 61344 | consumed tokens: 125632512 | elapsed time per iteration (s): 15.23 | learning rate: 2.010E-05 | global batch size: 16 | lm loss: 5.287287E+00 | grad norm: 0.970 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3835/ 128728 | consumed samples: 61360 | consumed tokens: 125665280 | elapsed time per iteration (s): 15.22 | learning rate: 2.011E-05 | global batch size: 16 | lm loss: 5.399631E+00 | grad norm: 1.285 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3836/ 128728 | consumed samples: 61376 | consumed tokens: 125698048 | elapsed time per iteration (s): 15.20 | learning rate: 2.011E-05 | global batch size: 16 | lm loss: 5.347549E+00 | grad norm: 0.670 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3837/ 128728 | consumed samples: 61392 | consumed tokens: 125730816 | elapsed time per iteration (s): 15.18 | learning rate: 2.012E-05 | global batch size: 16 | lm loss: 5.494516E+00 | grad norm: 0.950 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3838/ 128728 | consumed samples: 61408 | consumed tokens: 125763584 | elapsed time per iteration (s): 15.27 | learning rate: 2.012E-05 | global batch size: 16 | lm loss: 5.462282E+00 | grad norm: 1.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 3839/ 128728 | consumed samples: 61424 | consumed tokens: 125796352 | elapsed time per iteration (s): 15.24 | learning rate: 2.013E-05 | global batch size: 16 | lm loss: 5.329695E+00 | grad norm: 0.822 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3840/ 128728 | consumed samples: 61440 | consumed tokens: 125829120 | elapsed time per iteration (s): 15.26 | learning rate: 2.013E-05 | global batch size: 16 | lm loss: 5.455020E+00 | grad norm: 0.976 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3841/ 128728 | consumed samples: 61456 | consumed tokens: 125861888 | elapsed time per iteration (s): 15.25 | learning rate: 2.014E-05 | global batch size: 16 | lm loss: 5.388807E+00 | grad norm: 0.854 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.04 | [default7]: iteration 3842/ 128728 | consumed samples: 61472 | consumed tokens: 125894656 | elapsed time per iteration (s): 15.19 | learning rate: 2.014E-05 | global batch size: 16 | lm loss: 5.453071E+00 | grad norm: 0.689 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 3843/ 128728 | consumed samples: 61488 | consumed tokens: 125927424 | elapsed time per iteration (s): 15.20 | learning rate: 2.015E-05 | global batch size: 16 | lm loss: 5.550716E+00 | grad norm: 0.829 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3844/ 128728 | consumed samples: 61504 | consumed tokens: 125960192 | elapsed time per iteration (s): 15.20 | learning rate: 2.015E-05 | global batch size: 16 | lm loss: 5.434635E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3845/ 128728 | consumed samples: 61520 | consumed tokens: 125992960 | elapsed time per iteration (s): 15.15 | learning rate: 2.016E-05 | global batch size: 16 | lm loss: 5.393171E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 3846/ 128728 | consumed samples: 61536 | consumed tokens: 126025728 | elapsed time per iteration (s): 15.16 | learning rate: 2.016E-05 | global batch size: 16 | lm loss: 5.437396E+00 | grad norm: 1.099 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 3847/ 128728 | consumed samples: 61552 | consumed tokens: 126058496 | elapsed time per iteration (s): 15.20 | learning rate: 2.017E-05 | global batch size: 16 | lm loss: 5.379783E+00 | grad norm: 1.075 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3848/ 128728 | consumed samples: 61568 | consumed tokens: 126091264 | elapsed time per iteration (s): 15.22 | learning rate: 2.017E-05 | global batch size: 16 | lm loss: 5.551754E+00 | grad norm: 0.645 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3849/ 128728 | consumed samples: 61584 | consumed tokens: 126124032 | elapsed time per iteration (s): 15.22 | learning rate: 2.018E-05 | global batch size: 16 | lm loss: 5.263428E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3850/ 128728 | consumed samples: 61600 | consumed tokens: 126156800 | elapsed time per iteration (s): 15.22 | learning rate: 2.019E-05 | global batch size: 16 | lm loss: 5.389133E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3851/ 128728 | consumed samples: 61616 | consumed tokens: 126189568 | elapsed time per iteration (s): 15.20 | learning rate: 2.019E-05 | global batch size: 16 | lm loss: 5.425191E+00 | grad norm: 0.903 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3852/ 128728 | consumed samples: 61632 | consumed tokens: 126222336 | elapsed time per iteration (s): 15.22 | learning rate: 2.020E-05 | global batch size: 16 | lm loss: 5.259414E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3853/ 128728 | consumed samples: 61648 | consumed tokens: 126255104 | elapsed time per iteration (s): 15.20 | learning rate: 2.020E-05 | global batch size: 16 | lm loss: 5.419950E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3854/ 128728 | consumed samples: 61664 | consumed tokens: 126287872 | elapsed time per iteration (s): 15.18 | learning rate: 2.021E-05 | global batch size: 16 | lm loss: 5.455901E+00 | grad norm: 0.670 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3855/ 128728 | consumed samples: 61680 | consumed tokens: 126320640 | elapsed time per iteration (s): 15.22 | learning rate: 2.021E-05 | global batch size: 16 | lm loss: 5.723430E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3856/ 128728 | consumed samples: 61696 | consumed tokens: 126353408 | elapsed time per iteration (s): 15.14 | learning rate: 2.022E-05 | global batch size: 16 | lm loss: 5.380040E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 3857/ 128728 | consumed samples: 61712 | consumed tokens: 126386176 | elapsed time per iteration (s): 15.18 | learning rate: 2.022E-05 | global batch size: 16 | lm loss: 5.547056E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3858/ 128728 | consumed samples: 61728 | consumed tokens: 126418944 | elapsed time per iteration (s): 15.17 | learning rate: 2.023E-05 | global batch size: 16 | lm loss: 5.517189E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3859/ 128728 | consumed samples: 61744 | consumed tokens: 126451712 | elapsed time per iteration (s): 15.19 | learning rate: 2.023E-05 | global batch size: 16 | lm loss: 5.323791E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3860/ 128728 | consumed samples: 61760 | consumed tokens: 126484480 | elapsed time per iteration (s): 15.19 | learning rate: 2.024E-05 | global batch size: 16 | lm loss: 5.446847E+00 | grad norm: 1.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3861/ 128728 | consumed samples: 61776 | consumed tokens: 126517248 | elapsed time per iteration (s): 15.13 | learning rate: 2.024E-05 | global batch size: 16 | lm loss: 5.215536E+00 | grad norm: 0.851 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 3862/ 128728 | consumed samples: 61792 | consumed tokens: 126550016 | elapsed time per iteration (s): 15.16 | learning rate: 2.025E-05 | global batch size: 16 | lm loss: 5.761042E+00 | grad norm: 0.788 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 3863/ 128728 | consumed samples: 61808 | consumed tokens: 126582784 | elapsed time per iteration (s): 15.22 | learning rate: 2.025E-05 | global batch size: 16 | lm loss: 5.237271E+00 | grad norm: 0.832 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3864/ 128728 | consumed samples: 61824 | consumed tokens: 126615552 | elapsed time per iteration (s): 15.23 | learning rate: 2.026E-05 | global batch size: 16 | lm loss: 5.645336E+00 | grad norm: 0.810 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3865/ 128728 | consumed samples: 61840 | consumed tokens: 126648320 | elapsed time per iteration (s): 15.25 | learning rate: 2.026E-05 | global batch size: 16 | lm loss: 5.387892E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3866/ 128728 | consumed samples: 61856 | consumed tokens: 126681088 | elapsed time per iteration (s): 15.24 | learning rate: 2.027E-05 | global batch size: 16 | lm loss: 5.583735E+00 | grad norm: 1.042 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3867/ 128728 | consumed samples: 61872 | consumed tokens: 126713856 | elapsed time per iteration (s): 15.21 | learning rate: 2.027E-05 | global batch size: 16 | lm loss: 5.244458E+00 | grad norm: 0.738 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3868/ 128728 | consumed samples: 61888 | consumed tokens: 126746624 | elapsed time per iteration (s): 15.24 | learning rate: 2.028E-05 | global batch size: 16 | lm loss: 5.565816E+00 | grad norm: 0.716 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3869/ 128728 | consumed samples: 61904 | consumed tokens: 126779392 | elapsed time per iteration (s): 15.19 | learning rate: 2.028E-05 | global batch size: 16 | lm loss: 5.393667E+00 | grad norm: 1.035 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3870/ 128728 | consumed samples: 61920 | consumed tokens: 126812160 | elapsed time per iteration (s): 15.17 | learning rate: 2.029E-05 | global batch size: 16 | lm loss: 5.407505E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3871/ 128728 | consumed samples: 61936 | consumed tokens: 126844928 | elapsed time per iteration (s): 15.22 | learning rate: 2.030E-05 | global batch size: 16 | lm loss: 5.168123E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3872/ 128728 | consumed samples: 61952 | consumed tokens: 126877696 | elapsed time per iteration (s): 15.21 | learning rate: 2.030E-05 | global batch size: 16 | lm loss: 5.608961E+00 | grad norm: 0.670 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3873/ 128728 | consumed samples: 61968 | consumed tokens: 126910464 | elapsed time per iteration (s): 15.17 | learning rate: 2.031E-05 | global batch size: 16 | lm loss: 5.526161E+00 | grad norm: 1.332 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3874/ 128728 | consumed samples: 61984 | consumed tokens: 126943232 | elapsed time per iteration (s): 15.21 | learning rate: 2.031E-05 | global batch size: 16 | lm loss: 5.512238E+00 | grad norm: 2.111 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3875/ 128728 | consumed samples: 62000 | consumed tokens: 126976000 | elapsed time per iteration (s): 15.24 | learning rate: 2.032E-05 | global batch size: 16 | lm loss: 5.310292E+00 | grad norm: 0.811 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3876/ 128728 | consumed samples: 62016 | consumed tokens: 127008768 | elapsed time per iteration (s): 15.22 | learning rate: 2.032E-05 | global batch size: 16 | lm loss: 5.546309E+00 | grad norm: 0.779 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3877/ 128728 | consumed samples: 62032 | consumed tokens: 127041536 | elapsed time per iteration (s): 15.24 | learning rate: 2.033E-05 | global batch size: 16 | lm loss: 5.386329E+00 | grad norm: 0.829 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3878/ 128728 | consumed samples: 62048 | consumed tokens: 127074304 | elapsed time per iteration (s): 15.21 | learning rate: 2.033E-05 | global batch size: 16 | lm loss: 5.407649E+00 | grad norm: 0.822 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3879/ 128728 | consumed samples: 62064 | consumed tokens: 127107072 | elapsed time per iteration (s): 15.23 | learning rate: 2.034E-05 | global batch size: 16 | lm loss: 5.325084E+00 | grad norm: 0.858 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3880/ 128728 | consumed samples: 62080 | consumed tokens: 127139840 | elapsed time per iteration (s): 15.25 | learning rate: 2.034E-05 | global batch size: 16 | lm loss: 5.383338E+00 | grad norm: 0.838 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3881/ 128728 | consumed samples: 62096 | consumed tokens: 127172608 | elapsed time per iteration (s): 15.19 | learning rate: 2.035E-05 | global batch size: 16 | lm loss: 5.435583E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 3882/ 128728 | consumed samples: 62112 | consumed tokens: 127205376 | elapsed time per iteration (s): 15.22 | learning rate: 2.035E-05 | global batch size: 16 | lm loss: 5.391198E+00 | grad norm: 1.092 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3883/ 128728 | consumed samples: 62128 | consumed tokens: 127238144 | elapsed time per iteration (s): 15.24 | learning rate: 2.036E-05 | global batch size: 16 | lm loss: 5.385926E+00 | grad norm: 0.687 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3884/ 128728 | consumed samples: 62144 | consumed tokens: 127270912 | elapsed time per iteration (s): 15.22 | learning rate: 2.036E-05 | global batch size: 16 | lm loss: 5.435524E+00 | grad norm: 0.777 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3885/ 128728 | consumed samples: 62160 | consumed tokens: 127303680 | elapsed time per iteration (s): 15.19 | learning rate: 2.037E-05 | global batch size: 16 | lm loss: 5.325030E+00 | grad norm: 1.327 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 3886/ 128728 | consumed samples: 62176 | consumed tokens: 127336448 | elapsed time per iteration (s): 15.21 | learning rate: 2.037E-05 | global batch size: 16 | lm loss: 5.474463E+00 | grad norm: 0.860 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3887/ 128728 | consumed samples: 62192 | consumed tokens: 127369216 | elapsed time per iteration (s): 15.21 | learning rate: 2.038E-05 | global batch size: 16 | lm loss: 5.445851E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3888/ 128728 | consumed samples: 62208 | consumed tokens: 127401984 | elapsed time per iteration (s): 15.22 | learning rate: 2.038E-05 | global batch size: 16 | lm loss: 5.609439E+00 | grad norm: 0.876 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3889/ 128728 | consumed samples: 62224 | consumed tokens: 127434752 | elapsed time per iteration (s): 15.23 | learning rate: 2.039E-05 | global batch size: 16 | lm loss: 5.400331E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3890/ 128728 | consumed samples: 62240 | consumed tokens: 127467520 | elapsed time per iteration (s): 15.23 | learning rate: 2.039E-05 | global batch size: 16 | lm loss: 5.481973E+00 | grad norm: 0.711 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3891/ 128728 | consumed samples: 62256 | consumed tokens: 127500288 | elapsed time per iteration (s): 15.18 | learning rate: 2.040E-05 | global batch size: 16 | lm loss: 5.350195E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3892/ 128728 | consumed samples: 62272 | consumed tokens: 127533056 | elapsed time per iteration (s): 15.27 | learning rate: 2.041E-05 | global batch size: 16 | lm loss: 5.500158E+00 | grad norm: 2.969 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 3893/ 128728 | consumed samples: 62288 | consumed tokens: 127565824 | elapsed time per iteration (s): 15.24 | learning rate: 2.041E-05 | global batch size: 16 | lm loss: 5.532178E+00 | grad norm: 0.834 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3894/ 128728 | consumed samples: 62304 | consumed tokens: 127598592 | elapsed time per iteration (s): 15.25 | learning rate: 2.042E-05 | global batch size: 16 | lm loss: 5.326356E+00 | grad norm: 1.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3895/ 128728 | consumed samples: 62320 | consumed tokens: 127631360 | elapsed time per iteration (s): 15.22 | learning rate: 2.042E-05 | global batch size: 16 | lm loss: 5.424766E+00 | grad norm: 0.765 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3896/ 128728 | consumed samples: 62336 | consumed tokens: 127664128 | elapsed time per iteration (s): 15.22 | learning rate: 2.043E-05 | global batch size: 16 | lm loss: 5.275890E+00 | grad norm: 0.830 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3897/ 128728 | consumed samples: 62352 | consumed tokens: 127696896 | elapsed time per iteration (s): 15.23 | learning rate: 2.043E-05 | global batch size: 16 | lm loss: 5.232322E+00 | grad norm: 1.560 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3898/ 128728 | consumed samples: 62368 | consumed tokens: 127729664 | elapsed time per iteration (s): 15.19 | learning rate: 2.044E-05 | global batch size: 16 | lm loss: 5.657388E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3899/ 128728 | consumed samples: 62384 | consumed tokens: 127762432 | elapsed time per iteration (s): 15.20 | learning rate: 2.044E-05 | global batch size: 16 | lm loss: 5.394963E+00 | grad norm: 0.955 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3900/ 128728 | consumed samples: 62400 | consumed tokens: 127795200 | elapsed time per iteration (s): 15.25 | learning rate: 2.045E-05 | global batch size: 16 | lm loss: 5.370610E+00 | grad norm: 1.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3901/ 128728 | consumed samples: 62416 | consumed tokens: 127827968 | elapsed time per iteration (s): 15.24 | learning rate: 2.045E-05 | global batch size: 16 | lm loss: 5.365441E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3902/ 128728 | consumed samples: 62432 | consumed tokens: 127860736 | elapsed time per iteration (s): 15.24 | learning rate: 2.046E-05 | global batch size: 16 | lm loss: 5.406076E+00 | grad norm: 0.880 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3903/ 128728 | consumed samples: 62448 | consumed tokens: 127893504 | elapsed time per iteration (s): 15.25 | learning rate: 2.046E-05 | global batch size: 16 | lm loss: 5.409226E+00 | grad norm: 0.924 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3904/ 128728 | consumed samples: 62464 | consumed tokens: 127926272 | elapsed time per iteration (s): 15.30 | learning rate: 2.047E-05 | global batch size: 16 | lm loss: 5.347217E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.046 | TFLOPs: 8.01 | [default7]: iteration 3905/ 128728 | consumed samples: 62480 | consumed tokens: 127959040 | elapsed time per iteration (s): 15.21 | learning rate: 2.047E-05 | global batch size: 16 | lm loss: 5.564732E+00 | grad norm: 0.652 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3906/ 128728 | consumed samples: 62496 | consumed tokens: 127991808 | elapsed time per iteration (s): 15.21 | learning rate: 2.048E-05 | global batch size: 16 | lm loss: 5.865915E+00 | grad norm: 0.830 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3907/ 128728 | consumed samples: 62512 | consumed tokens: 128024576 | elapsed time per iteration (s): 15.19 | learning rate: 2.048E-05 | global batch size: 16 | lm loss: 4.977544E+00 | grad norm: 0.840 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3908/ 128728 | consumed samples: 62528 | consumed tokens: 128057344 | elapsed time per iteration (s): 15.19 | learning rate: 2.049E-05 | global batch size: 16 | lm loss: 5.338539E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 3909/ 128728 | consumed samples: 62544 | consumed tokens: 128090112 | elapsed time per iteration (s): 15.25 | learning rate: 2.049E-05 | global batch size: 16 | lm loss: 5.439590E+00 | grad norm: 1.099 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3910/ 128728 | consumed samples: 62560 | consumed tokens: 128122880 | elapsed time per iteration (s): 15.26 | learning rate: 2.050E-05 | global batch size: 16 | lm loss: 5.531397E+00 | grad norm: 1.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3911/ 128728 | consumed samples: 62576 | consumed tokens: 128155648 | elapsed time per iteration (s): 15.25 | learning rate: 2.050E-05 | global batch size: 16 | lm loss: 5.485893E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3912/ 128728 | consumed samples: 62592 | consumed tokens: 128188416 | elapsed time per iteration (s): 15.23 | learning rate: 2.051E-05 | global batch size: 16 | lm loss: 5.491755E+00 | grad norm: 0.739 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3913/ 128728 | consumed samples: 62608 | consumed tokens: 128221184 | elapsed time per iteration (s): 15.23 | learning rate: 2.052E-05 | global batch size: 16 | lm loss: 5.505841E+00 | grad norm: 0.875 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3914/ 128728 | consumed samples: 62624 | consumed tokens: 128253952 | elapsed time per iteration (s): 15.24 | learning rate: 2.052E-05 | global batch size: 16 | lm loss: 5.293841E+00 | grad norm: 0.671 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3915/ 128728 | consumed samples: 62640 | consumed tokens: 128286720 | elapsed time per iteration (s): 15.23 | learning rate: 2.053E-05 | global batch size: 16 | lm loss: 5.651334E+00 | grad norm: 0.872 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3916/ 128728 | consumed samples: 62656 | consumed tokens: 128319488 | elapsed time per iteration (s): 15.24 | learning rate: 2.053E-05 | global batch size: 16 | lm loss: 5.581880E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3917/ 128728 | consumed samples: 62672 | consumed tokens: 128352256 | elapsed time per iteration (s): 15.22 | learning rate: 2.054E-05 | global batch size: 16 | lm loss: 5.295708E+00 | grad norm: 0.831 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3918/ 128728 | consumed samples: 62688 | consumed tokens: 128385024 | elapsed time per iteration (s): 15.21 | learning rate: 2.054E-05 | global batch size: 16 | lm loss: 5.468671E+00 | grad norm: 0.773 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3919/ 128728 | consumed samples: 62704 | consumed tokens: 128417792 | elapsed time per iteration (s): 15.23 | learning rate: 2.055E-05 | global batch size: 16 | lm loss: 5.449612E+00 | grad norm: 0.736 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3920/ 128728 | consumed samples: 62720 | consumed tokens: 128450560 | elapsed time per iteration (s): 15.18 | learning rate: 2.055E-05 | global batch size: 16 | lm loss: 5.470665E+00 | grad norm: 0.720 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3921/ 128728 | consumed samples: 62736 | consumed tokens: 128483328 | elapsed time per iteration (s): 15.17 | learning rate: 2.056E-05 | global batch size: 16 | lm loss: 5.540703E+00 | grad norm: 0.697 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3922/ 128728 | consumed samples: 62752 | consumed tokens: 128516096 | elapsed time per iteration (s): 15.17 | learning rate: 2.056E-05 | global batch size: 16 | lm loss: 5.231455E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3923/ 128728 | consumed samples: 62768 | consumed tokens: 128548864 | elapsed time per iteration (s): 15.21 | learning rate: 2.057E-05 | global batch size: 16 | lm loss: 5.513610E+00 | grad norm: 0.946 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3924/ 128728 | consumed samples: 62784 | consumed tokens: 128581632 | elapsed time per iteration (s): 15.19 | learning rate: 2.057E-05 | global batch size: 16 | lm loss: 5.542394E+00 | grad norm: 0.689 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3925/ 128728 | consumed samples: 62800 | consumed tokens: 128614400 | elapsed time per iteration (s): 15.21 | learning rate: 2.058E-05 | global batch size: 16 | lm loss: 5.609309E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3926/ 128728 | consumed samples: 62816 | consumed tokens: 128647168 | elapsed time per iteration (s): 15.19 | learning rate: 2.058E-05 | global batch size: 16 | lm loss: 5.394788E+00 | grad norm: 1.061 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3927/ 128728 | consumed samples: 62832 | consumed tokens: 128679936 | elapsed time per iteration (s): 15.23 | learning rate: 2.059E-05 | global batch size: 16 | lm loss: 5.177278E+00 | grad norm: 0.852 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3928/ 128728 | consumed samples: 62848 | consumed tokens: 128712704 | elapsed time per iteration (s): 15.19 | learning rate: 2.059E-05 | global batch size: 16 | lm loss: 5.202007E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3929/ 128728 | consumed samples: 62864 | consumed tokens: 128745472 | elapsed time per iteration (s): 15.19 | learning rate: 2.060E-05 | global batch size: 16 | lm loss: 5.402996E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 3930/ 128728 | consumed samples: 62880 | consumed tokens: 128778240 | elapsed time per iteration (s): 15.17 | learning rate: 2.060E-05 | global batch size: 16 | lm loss: 5.260327E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3931/ 128728 | consumed samples: 62896 | consumed tokens: 128811008 | elapsed time per iteration (s): 15.22 | learning rate: 2.061E-05 | global batch size: 16 | lm loss: 5.570568E+00 | grad norm: 3.273 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3932/ 128728 | consumed samples: 62912 | consumed tokens: 128843776 | elapsed time per iteration (s): 15.21 | learning rate: 2.062E-05 | global batch size: 16 | lm loss: 5.326445E+00 | grad norm: 0.705 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3933/ 128728 | consumed samples: 62928 | consumed tokens: 128876544 | elapsed time per iteration (s): 15.21 | learning rate: 2.062E-05 | global batch size: 16 | lm loss: 5.454953E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3934/ 128728 | consumed samples: 62944 | consumed tokens: 128909312 | elapsed time per iteration (s): 15.22 | learning rate: 2.063E-05 | global batch size: 16 | lm loss: 5.214005E+00 | grad norm: 0.796 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3935/ 128728 | consumed samples: 62960 | consumed tokens: 128942080 | elapsed time per iteration (s): 15.21 | learning rate: 2.063E-05 | global batch size: 16 | lm loss: 5.422129E+00 | grad norm: 0.805 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3936/ 128728 | consumed samples: 62976 | consumed tokens: 128974848 | elapsed time per iteration (s): 15.23 | learning rate: 2.064E-05 | global batch size: 16 | lm loss: 5.433270E+00 | grad norm: 0.944 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3937/ 128728 | consumed samples: 62992 | consumed tokens: 129007616 | elapsed time per iteration (s): 15.20 | learning rate: 2.064E-05 | global batch size: 16 | lm loss: 5.334735E+00 | grad norm: 0.738 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3938/ 128728 | consumed samples: 63008 | consumed tokens: 129040384 | elapsed time per iteration (s): 15.19 | learning rate: 2.065E-05 | global batch size: 16 | lm loss: 5.458637E+00 | grad norm: 0.729 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3939/ 128728 | consumed samples: 63024 | consumed tokens: 129073152 | elapsed time per iteration (s): 15.25 | learning rate: 2.065E-05 | global batch size: 16 | lm loss: 5.288385E+00 | grad norm: 0.787 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3940/ 128728 | consumed samples: 63040 | consumed tokens: 129105920 | elapsed time per iteration (s): 15.21 | learning rate: 2.066E-05 | global batch size: 16 | lm loss: 5.444874E+00 | grad norm: 0.739 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3941/ 128728 | consumed samples: 63056 | consumed tokens: 129138688 | elapsed time per iteration (s): 15.21 | learning rate: 2.066E-05 | global batch size: 16 | lm loss: 5.580392E+00 | grad norm: 0.702 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3942/ 128728 | consumed samples: 63072 | consumed tokens: 129171456 | elapsed time per iteration (s): 15.18 | learning rate: 2.067E-05 | global batch size: 16 | lm loss: 5.633109E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3943/ 128728 | consumed samples: 63088 | consumed tokens: 129204224 | elapsed time per iteration (s): 15.21 | learning rate: 2.067E-05 | global batch size: 16 | lm loss: 5.486689E+00 | grad norm: 0.673 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3944/ 128728 | consumed samples: 63104 | consumed tokens: 129236992 | elapsed time per iteration (s): 15.21 | learning rate: 2.068E-05 | global batch size: 16 | lm loss: 5.653194E+00 | grad norm: 0.751 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3945/ 128728 | consumed samples: 63120 | consumed tokens: 129269760 | elapsed time per iteration (s): 15.14 | learning rate: 2.068E-05 | global batch size: 16 | lm loss: 5.570617E+00 | grad norm: 0.703 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 3946/ 128728 | consumed samples: 63136 | consumed tokens: 129302528 | elapsed time per iteration (s): 15.23 | learning rate: 2.069E-05 | global batch size: 16 | lm loss: 5.431407E+00 | grad norm: 0.695 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3947/ 128728 | consumed samples: 63152 | consumed tokens: 129335296 | elapsed time per iteration (s): 15.23 | learning rate: 2.069E-05 | global batch size: 16 | lm loss: 5.536205E+00 | grad norm: 0.667 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3948/ 128728 | consumed samples: 63168 | consumed tokens: 129368064 | elapsed time per iteration (s): 15.14 | learning rate: 2.070E-05 | global batch size: 16 | lm loss: 5.436441E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 3949/ 128728 | consumed samples: 63184 | consumed tokens: 129400832 | elapsed time per iteration (s): 15.24 | learning rate: 2.070E-05 | global batch size: 16 | lm loss: 5.337091E+00 | grad norm: 0.782 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3950/ 128728 | consumed samples: 63200 | consumed tokens: 129433600 | elapsed time per iteration (s): 15.23 | learning rate: 2.071E-05 | global batch size: 16 | lm loss: 5.656445E+00 | grad norm: 0.707 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3951/ 128728 | consumed samples: 63216 | consumed tokens: 129466368 | elapsed time per iteration (s): 15.23 | learning rate: 2.071E-05 | global batch size: 16 | lm loss: 5.297698E+00 | grad norm: 0.972 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3952/ 128728 | consumed samples: 63232 | consumed tokens: 129499136 | elapsed time per iteration (s): 15.18 | learning rate: 2.072E-05 | global batch size: 16 | lm loss: 5.657709E+00 | grad norm: 0.954 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3953/ 128728 | consumed samples: 63248 | consumed tokens: 129531904 | elapsed time per iteration (s): 15.19 | learning rate: 2.073E-05 | global batch size: 16 | lm loss: 5.463843E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3954/ 128728 | consumed samples: 63264 | consumed tokens: 129564672 | elapsed time per iteration (s): 15.20 | learning rate: 2.073E-05 | global batch size: 16 | lm loss: 5.530795E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3955/ 128728 | consumed samples: 63280 | consumed tokens: 129597440 | elapsed time per iteration (s): 15.20 | learning rate: 2.074E-05 | global batch size: 16 | lm loss: 5.277174E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3956/ 128728 | consumed samples: 63296 | consumed tokens: 129630208 | elapsed time per iteration (s): 15.19 | learning rate: 2.074E-05 | global batch size: 16 | lm loss: 5.323586E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3957/ 128728 | consumed samples: 63312 | consumed tokens: 129662976 | elapsed time per iteration (s): 15.22 | learning rate: 2.075E-05 | global batch size: 16 | lm loss: 5.472128E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3958/ 128728 | consumed samples: 63328 | consumed tokens: 129695744 | elapsed time per iteration (s): 15.20 | learning rate: 2.075E-05 | global batch size: 16 | lm loss: 5.385518E+00 | grad norm: 0.666 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3959/ 128728 | consumed samples: 63344 | consumed tokens: 129728512 | elapsed time per iteration (s): 15.23 | learning rate: 2.076E-05 | global batch size: 16 | lm loss: 5.426952E+00 | grad norm: 0.686 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3960/ 128728 | consumed samples: 63360 | consumed tokens: 129761280 | elapsed time per iteration (s): 15.22 | learning rate: 2.076E-05 | global batch size: 16 | lm loss: 5.452140E+00 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3961/ 128728 | consumed samples: 63376 | consumed tokens: 129794048 | elapsed time per iteration (s): 15.23 | learning rate: 2.077E-05 | global batch size: 16 | lm loss: 5.372558E+00 | grad norm: 0.775 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3962/ 128728 | consumed samples: 63392 | consumed tokens: 129826816 | elapsed time per iteration (s): 15.25 | learning rate: 2.077E-05 | global batch size: 16 | lm loss: 5.433863E+00 | grad norm: 0.662 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3963/ 128728 | consumed samples: 63408 | consumed tokens: 129859584 | elapsed time per iteration (s): 15.20 | learning rate: 2.078E-05 | global batch size: 16 | lm loss: 5.048560E+00 | grad norm: 0.987 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3964/ 128728 | consumed samples: 63424 | consumed tokens: 129892352 | elapsed time per iteration (s): 15.14 | learning rate: 2.078E-05 | global batch size: 16 | lm loss: 5.615795E+00 | grad norm: 0.755 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 3965/ 128728 | consumed samples: 63440 | consumed tokens: 129925120 | elapsed time per iteration (s): 15.22 | learning rate: 2.079E-05 | global batch size: 16 | lm loss: 5.405645E+00 | grad norm: 0.682 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3966/ 128728 | consumed samples: 63456 | consumed tokens: 129957888 | elapsed time per iteration (s): 15.20 | learning rate: 2.079E-05 | global batch size: 16 | lm loss: 5.304680E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3967/ 128728 | consumed samples: 63472 | consumed tokens: 129990656 | elapsed time per iteration (s): 15.22 | learning rate: 2.080E-05 | global batch size: 16 | lm loss: 5.625960E+00 | grad norm: 1.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3968/ 128728 | consumed samples: 63488 | consumed tokens: 130023424 | elapsed time per iteration (s): 15.21 | learning rate: 2.080E-05 | global batch size: 16 | lm loss: 5.581823E+00 | grad norm: 0.741 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3969/ 128728 | consumed samples: 63504 | consumed tokens: 130056192 | elapsed time per iteration (s): 15.18 | learning rate: 2.081E-05 | global batch size: 16 | lm loss: 5.444682E+00 | grad norm: 0.779 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3970/ 128728 | consumed samples: 63520 | consumed tokens: 130088960 | elapsed time per iteration (s): 15.21 | learning rate: 2.081E-05 | global batch size: 16 | lm loss: 5.335429E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3971/ 128728 | consumed samples: 63536 | consumed tokens: 130121728 | elapsed time per iteration (s): 15.19 | learning rate: 2.082E-05 | global batch size: 16 | lm loss: 5.558789E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3972/ 128728 | consumed samples: 63552 | consumed tokens: 130154496 | elapsed time per iteration (s): 15.16 | learning rate: 2.082E-05 | global batch size: 16 | lm loss: 5.333210E+00 | grad norm: 0.948 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 3973/ 128728 | consumed samples: 63568 | consumed tokens: 130187264 | elapsed time per iteration (s): 15.20 | learning rate: 2.083E-05 | global batch size: 16 | lm loss: 5.441347E+00 | grad norm: 0.875 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3974/ 128728 | consumed samples: 63584 | consumed tokens: 130220032 | elapsed time per iteration (s): 15.15 | learning rate: 2.084E-05 | global batch size: 16 | lm loss: 5.388178E+00 | grad norm: 0.688 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 3975/ 128728 | consumed samples: 63600 | consumed tokens: 130252800 | elapsed time per iteration (s): 15.18 | learning rate: 2.084E-05 | global batch size: 16 | lm loss: 5.478914E+00 | grad norm: 0.806 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3976/ 128728 | consumed samples: 63616 | consumed tokens: 130285568 | elapsed time per iteration (s): 15.22 | learning rate: 2.085E-05 | global batch size: 16 | lm loss: 5.390545E+00 | grad norm: 0.767 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3977/ 128728 | consumed samples: 63632 | consumed tokens: 130318336 | elapsed time per iteration (s): 15.22 | learning rate: 2.085E-05 | global batch size: 16 | lm loss: 5.489986E+00 | grad norm: 0.830 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3978/ 128728 | consumed samples: 63648 | consumed tokens: 130351104 | elapsed time per iteration (s): 15.20 | learning rate: 2.086E-05 | global batch size: 16 | lm loss: 5.220353E+00 | grad norm: 0.890 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3979/ 128728 | consumed samples: 63664 | consumed tokens: 130383872 | elapsed time per iteration (s): 15.22 | learning rate: 2.086E-05 | global batch size: 16 | lm loss: 5.544164E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3980/ 128728 | consumed samples: 63680 | consumed tokens: 130416640 | elapsed time per iteration (s): 15.21 | learning rate: 2.087E-05 | global batch size: 16 | lm loss: 5.339544E+00 | grad norm: 1.046 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3981/ 128728 | consumed samples: 63696 | consumed tokens: 130449408 | elapsed time per iteration (s): 15.24 | learning rate: 2.087E-05 | global batch size: 16 | lm loss: 5.444860E+00 | grad norm: 0.815 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3982/ 128728 | consumed samples: 63712 | consumed tokens: 130482176 | elapsed time per iteration (s): 15.20 | learning rate: 2.088E-05 | global batch size: 16 | lm loss: 5.265263E+00 | grad norm: 1.516 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3983/ 128728 | consumed samples: 63728 | consumed tokens: 130514944 | elapsed time per iteration (s): 15.25 | learning rate: 2.088E-05 | global batch size: 16 | lm loss: 5.278424E+00 | grad norm: 0.771 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3984/ 128728 | consumed samples: 63744 | consumed tokens: 130547712 | elapsed time per iteration (s): 15.21 | learning rate: 2.089E-05 | global batch size: 16 | lm loss: 5.488007E+00 | grad norm: 0.689 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3985/ 128728 | consumed samples: 63760 | consumed tokens: 130580480 | elapsed time per iteration (s): 15.23 | learning rate: 2.089E-05 | global batch size: 16 | lm loss: 5.426978E+00 | grad norm: 0.826 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 3986/ 128728 | consumed samples: 63776 | consumed tokens: 130613248 | elapsed time per iteration (s): 15.25 | learning rate: 2.090E-05 | global batch size: 16 | lm loss: 5.588451E+00 | grad norm: 1.938 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3987/ 128728 | consumed samples: 63792 | consumed tokens: 130646016 | elapsed time per iteration (s): 15.26 | learning rate: 2.090E-05 | global batch size: 16 | lm loss: 5.366596E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 3988/ 128728 | consumed samples: 63808 | consumed tokens: 130678784 | elapsed time per iteration (s): 15.17 | learning rate: 2.091E-05 | global batch size: 16 | lm loss: 5.419157E+00 | grad norm: 0.912 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 3989/ 128728 | consumed samples: 63824 | consumed tokens: 130711552 | elapsed time per iteration (s): 15.20 | learning rate: 2.091E-05 | global batch size: 16 | lm loss: 5.551931E+00 | grad norm: 1.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3990/ 128728 | consumed samples: 63840 | consumed tokens: 130744320 | elapsed time per iteration (s): 15.24 | learning rate: 2.092E-05 | global batch size: 16 | lm loss: 5.222095E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3991/ 128728 | consumed samples: 63856 | consumed tokens: 130777088 | elapsed time per iteration (s): 15.21 | learning rate: 2.092E-05 | global batch size: 16 | lm loss: 5.331915E+00 | grad norm: 0.767 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 3992/ 128728 | consumed samples: 63872 | consumed tokens: 130809856 | elapsed time per iteration (s): 15.17 | learning rate: 2.093E-05 | global batch size: 16 | lm loss: 5.306742E+00 | grad norm: 0.773 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 3993/ 128728 | consumed samples: 63888 | consumed tokens: 130842624 | elapsed time per iteration (s): 15.23 | learning rate: 2.093E-05 | global batch size: 16 | lm loss: 5.580595E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 3994/ 128728 | consumed samples: 63904 | consumed tokens: 130875392 | elapsed time per iteration (s): 15.20 | learning rate: 2.094E-05 | global batch size: 16 | lm loss: 5.409997E+00 | grad norm: 0.711 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3995/ 128728 | consumed samples: 63920 | consumed tokens: 130908160 | elapsed time per iteration (s): 15.21 | learning rate: 2.095E-05 | global batch size: 16 | lm loss: 5.411019E+00 | grad norm: 1.052 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 3996/ 128728 | consumed samples: 63936 | consumed tokens: 130940928 | elapsed time per iteration (s): 15.20 | learning rate: 2.095E-05 | global batch size: 16 | lm loss: 5.389151E+00 | grad norm: 0.741 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 3997/ 128728 | consumed samples: 63952 | consumed tokens: 130973696 | elapsed time per iteration (s): 15.18 | learning rate: 2.096E-05 | global batch size: 16 | lm loss: 5.458337E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 3998/ 128728 | consumed samples: 63968 | consumed tokens: 131006464 | elapsed time per iteration (s): 15.23 | learning rate: 2.096E-05 | global batch size: 16 | lm loss: 5.207241E+00 | grad norm: 0.648 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 3999/ 128728 | consumed samples: 63984 | consumed tokens: 131039232 | elapsed time per iteration (s): 15.21 | learning rate: 2.097E-05 | global batch size: 16 | lm loss: 5.415626E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4000/ 128728 | consumed samples: 64000 | consumed tokens: 131072000 | elapsed time per iteration (s): 15.25 | learning rate: 2.097E-05 | global batch size: 16 | lm loss: 5.328699E+00 | grad norm: 0.771 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default0]:[2022-03-03 22:54:56,238] [INFO] [logging.py:69:log_dist] [Rank 0] step=4000, skipped=0, lr=[2.0971573687228642e-05, 2.0971573687228642e-05, 2.0971573687228642e-05], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)] [default0]:steps: 4000 loss: 5.3287 iter time (s): 14.265 samples/sec: 1.122 [default7]:------------------------------------------------------------------------------------------ [default7]:valid loss at iteration 4000 | lm loss value: 5.666330E+00 | lm loss PPL: 2.889720E+02 | [default7]:------------------------------------------------------------------------------------------ [default0]:saving checkpoint at iteration 4000 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default0]:[2022-03-03 22:55:25,744] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/mp_rank_00_model_states.pt [default1]:[2022-03-03 22:55:25,763] [INFO] [logging.py:69:log_dist] [Rank 1] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/mp_rank_01_model_states.pt [default7]:[2022-03-03 22:55:38,931] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_27_optim_states.pt [default4]:[2022-03-03 22:55:39,465] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_40_optim_states.pt [default7]:[2022-03-03 22:55:39,732] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_43_optim_states.pt [default2]:[2022-03-03 22:55:39,739] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_42_optim_states.pt [default5]:[2022-03-03 22:55:39,888] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt [default5]:[2022-03-03 22:55:39,963] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_41_optim_states.pt [default0]:[2022-03-03 22:55:39,987] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_40_optim_states.pt [default3]:[2022-03-03 22:55:40,034] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_43_optim_states.pt [default6]:[2022-03-03 22:55:40,033] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_42_optim_states.pt [default3]:[2022-03-03 22:55:40,050] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_31_optim_states.pt [default2]:[2022-03-03 22:55:40,107] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_26_optim_states.pt [default1]:[2022-03-03 22:55:40,217] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_41_optim_states.pt [default0]:[2022-03-03 22:55:40,300] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_28_optim_states.pt [default0]:[2022-03-03 22:55:40,303] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_24_optim_states.pt [default7]:[2022-03-03 22:55:40,396] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_15_optim_states.pt [default6]:[2022-03-03 22:55:40,358] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_26_optim_states.pt [default1]:[2022-03-03 22:55:40,406] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_25_optim_states.pt [default6]:[2022-03-03 22:55:40,462] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_14_optim_states.pt [default4]:[2022-03-03 22:55:40,737] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_24_optim_states.pt [default5]:[2022-03-03 22:55:40,696] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_25_optim_states.pt [default3]:[2022-03-03 22:55:41,026] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_27_optim_states.pt [default5]:[2022-03-03 22:55:40,986] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_13_optim_states.pt [default0]:[2022-03-03 22:55:41,039] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_32_optim_states.pt [default4]:[2022-03-03 22:55:41,014] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_12_optim_states.pt [default1]:[2022-03-03 22:55:41,074] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_33_optim_states.pt [default4]:[2022-03-03 22:55:41,070] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_32_optim_states.pt [default7]:[2022-03-03 22:55:41,080] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_35_optim_states.pt [default0]:[2022-03-03 22:55:41,206] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_12_optim_states.pt [default6]:[2022-03-03 22:55:41,207] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_34_optim_states.pt [default2]:[2022-03-03 22:55:41,198] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_14_optim_states.pt [default1]:[2022-03-03 22:55:41,370] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_13_optim_states.pt [default3]:[2022-03-03 22:55:41,359] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_35_optim_states.pt [default3]:[2022-03-03 22:55:41,419] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_15_optim_states.pt [default5]:[2022-03-03 22:55:41,545] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_33_optim_states.pt [default4]:[2022-03-03 22:55:41,515] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt [default2]:[2022-03-03 22:55:41,530] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_34_optim_states.pt [default3]:[2022-03-03 22:55:41,641] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_43_optim_states.pt [default3]:[2022-03-03 22:55:41,652] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_15_optim_states.pt [default2]:[2022-03-03 22:55:41,762] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_30_optim_states.pt [default1]:[2022-03-03 22:55:41,792] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_29_optim_states.pt [default5]:[2022-03-03 22:55:42,196] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_09_optim_states.pt [default1]:[2022-03-03 22:55:42,239] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_09_optim_states.pt [default1]:[2022-03-03 22:55:42,297] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_13_optim_states.pt [default4]:[2022-03-03 22:55:42,390] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_44_optim_states.pt [default0]:[2022-03-03 22:55:42,316] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_08_optim_states.pt [default2]:[2022-03-03 22:55:42,328] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_14_optim_states.pt [default7]:[2022-03-03 22:55:42,491] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_11_optim_states.pt [default1]:[2022-03-03 22:55:42,458] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_21_optim_states.pt [default6]:[2022-03-03 22:55:42,540] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_14_optim_states.pt [default0]:[2022-03-03 22:55:42,565] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_12_optim_states.pt [default5]:[2022-03-03 22:55:42,602] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_45_optim_states.pt [default0]:[2022-03-03 22:55:42,652] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_24_optim_states.pt [default7]:[2022-03-03 22:55:42,697] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_15_optim_states.pt [default5]:[2022-03-03 22:55:42,689] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_13_optim_states.pt [default4]:[2022-03-03 22:55:42,773] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_28_optim_states.pt [default4]:[2022-03-03 22:55:42,785] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_16_optim_states.pt [default6]:[2022-03-03 22:55:42,801] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_10_optim_states.pt [default6]:[2022-03-03 22:55:42,891] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_14_optim_states.pt [default1]:[2022-03-03 22:55:42,905] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_17_optim_states.pt [default4]:[2022-03-03 22:55:42,911] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_08_optim_states.pt [default1]:[2022-03-03 22:55:42,889] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_29_optim_states.pt [default5]:[2022-03-03 22:55:42,891] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_29_optim_states.pt [default7]:[2022-03-03 22:55:42,993] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_11_optim_states.pt [default0]:[2022-03-03 22:55:42,951] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_08_optim_states.pt [default1]:[2022-03-03 22:55:42,951] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_09_optim_states.pt [default4]:[2022-03-03 22:55:42,973] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_12_optim_states.pt [default3]:[2022-03-03 22:55:42,990] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_19_optim_states.pt [default1]:[2022-03-03 22:55:42,885] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_01_optim_states.pt [default7]:[2022-03-03 22:55:43,137] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_19_optim_states.pt [default5]:[2022-03-03 22:55:43,122] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_17_optim_states.pt [default6]:[2022-03-03 22:55:43,254] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_30_optim_states.pt [default3]:[2022-03-03 22:55:43,265] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_03_optim_states.pt [default1]:[2022-03-03 22:55:43,278] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_01_optim_states.pt [default7]:[2022-03-03 22:55:43,303] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_31_optim_states.pt [default2]:[2022-03-03 22:55:43,332] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_18_optim_states.pt [default0]:[2022-03-03 22:55:43,199] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt [default0]:[2022-03-03 22:55:43,267] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt [default5]:[2022-03-03 22:55:43,393] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_41_optim_states.pt [default0]:[2022-03-03 22:55:43,406] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_16_optim_states.pt [default4]:[2022-03-03 22:55:43,506] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_40_optim_states.pt [default0]:[2022-03-03 22:55:43,470] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_16_optim_states.pt [default3]:[2022-03-03 22:55:43,459] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_19_optim_states.pt [default2]:[2022-03-03 22:55:43,479] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_42_optim_states.pt [default7]:[2022-03-03 22:55:43,532] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_03_optim_states.pt [default6]:[2022-03-03 22:55:43,528] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_18_optim_states.pt [default6]:[2022-03-03 22:55:43,557] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_02_optim_states.pt [default2]:[2022-03-03 22:55:43,667] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_02_optim_states.pt [default1]:[2022-03-03 22:55:43,933] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_41_optim_states.pt [default7]:[2022-03-03 22:55:43,930] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_35_optim_states.pt [default7]:[2022-03-03 22:55:43,999] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_31_optim_states.pt [default2]:[2022-03-03 22:55:44,040] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_30_optim_states.pt [default7]:[2022-03-03 22:55:44,088] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_15_optim_states.pt [default3]:[2022-03-03 22:55:44,170] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_31_optim_states.pt [default7]:[2022-03-03 22:55:44,112] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_43_optim_states.pt [default6]:[2022-03-03 22:55:44,200] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_30_optim_states.pt [default6]:[2022-03-03 22:55:44,205] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_42_optim_states.pt [default2]:[2022-03-03 22:55:44,432] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_26_optim_states.pt [default4]:[2022-03-03 22:55:44,410] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt [default3]:[2022-03-03 22:55:44,440] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_27_optim_states.pt [default6]:[2022-03-03 22:55:44,510] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_10_optim_states.pt [default1]:[2022-03-03 22:55:44,529] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_25_optim_states.pt [default0]:[2022-03-03 22:55:44,553] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_40_optim_states.pt [default3]:[2022-03-03 22:55:44,599] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_35_optim_states.pt [default5]:[2022-03-03 22:55:44,635] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_01_optim_states.pt [default2]:[2022-03-03 22:55:44,637] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_10_optim_states.pt [default2]:[2022-03-03 22:55:44,670] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_10_optim_states.pt [default3]:[2022-03-03 22:55:44,703] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_11_optim_states.pt [default4]:[2022-03-03 22:55:44,741] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_28_optim_states.pt [default5]:[2022-03-03 22:55:44,727] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_29_optim_states.pt [default2]:[2022-03-03 22:55:44,927] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_02_optim_states.pt [default0]:[2022-03-03 22:55:45,031] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_28_optim_states.pt [default1]:[2022-03-03 22:55:45,034] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_17_optim_states.pt [default7]:[2022-03-03 22:55:45,053] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_43_optim_states.pt [default1]:[2022-03-03 22:55:45,071] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_29_optim_states.pt [default2]:[2022-03-03 22:55:45,161] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_18_optim_states.pt [default2]:[2022-03-03 22:55:45,175] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_18_optim_states.pt [default6]:[2022-03-03 22:55:45,239] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_18_optim_states.pt [default5]:[2022-03-03 22:55:45,287] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_05_optim_states.pt [default2]:[2022-03-03 22:55:45,336] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_06_optim_states.pt [default3]:[2022-03-03 22:55:45,353] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_07_optim_states.pt [default6]:[2022-03-03 22:55:45,437] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_34_optim_states.pt [default1]:[2022-03-03 22:55:45,471] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_13_optim_states.pt [default4]:[2022-03-03 22:55:45,529] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_04_optim_states.pt [default5]:[2022-03-03 22:55:45,524] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_05_optim_states.pt [default7]:[2022-03-03 22:55:45,480] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_47_optim_states.pt [default3]:[2022-03-03 22:55:45,567] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_03_optim_states.pt [default3]:[2022-03-03 22:55:45,631] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_31_optim_states.pt [default7]:[2022-03-03 22:55:45,686] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_03_optim_states.pt [default6]:[2022-03-03 22:55:45,775] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_42_optim_states.pt [default7]:[2022-03-03 22:55:45,719] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_19_optim_states.pt [default4]:[2022-03-03 22:55:45,813] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_20_optim_states.pt [default6]:[2022-03-03 22:55:45,817] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_02_optim_states.pt [default3]:[2022-03-03 22:55:45,812] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_43_optim_states.pt [default3]:[2022-03-03 22:55:45,857] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_11_optim_states.pt [default4]:[2022-03-03 22:55:45,909] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_32_optim_states.pt [default5]:[2022-03-03 22:55:45,974] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_33_optim_states.pt [default2]:[2022-03-03 22:55:46,158] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_30_optim_states.pt [default7]:[2022-03-03 22:55:46,206] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_47_optim_states.pt [default4]:[2022-03-03 22:55:46,187] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_08_optim_states.pt [default7]:[2022-03-03 22:55:46,215] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_23_optim_states.pt [default1]:[2022-03-03 22:55:46,174] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_37_optim_states.pt [default6]:[2022-03-03 22:55:46,268] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_46_optim_states.pt [default0]:[2022-03-03 22:55:46,310] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_20_optim_states.pt [default0]:[2022-03-03 22:55:46,296] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_12_optim_states.pt [default5]:[2022-03-03 22:55:46,294] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_09_optim_states.pt [default2]:[2022-03-03 22:55:46,373] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_46_optim_states.pt [default3]:[2022-03-03 22:55:46,351] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_47_optim_states.pt [default4]:[2022-03-03 22:55:46,457] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_24_optim_states.pt [default5]:[2022-03-03 22:55:46,502] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_21_optim_states.pt [default6]:[2022-03-03 22:55:46,443] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_22_optim_states.pt [default3]:[2022-03-03 22:55:46,491] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_31_optim_states.pt [default0]:[2022-03-03 22:55:46,535] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_44_optim_states.pt [default6]:[2022-03-03 22:55:46,541] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_46_optim_states.pt [default1]:[2022-03-03 22:55:46,546] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_29_optim_states.pt [default4]:[2022-03-03 22:55:46,581] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_16_optim_states.pt [default3]:[2022-03-03 22:55:46,569] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_19_optim_states.pt [default1]:[2022-03-03 22:55:46,595] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_45_optim_states.pt [default4]:[2022-03-03 22:55:46,694] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_40_optim_states.pt [default5]:[2022-03-03 22:55:46,707] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_33_optim_states.pt [default6]:[2022-03-03 22:55:46,732] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_42_optim_states.pt [default2]:[2022-03-03 22:55:46,723] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_22_optim_states.pt [default2]:[2022-03-03 22:55:46,769] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_10_optim_states.pt [default5]:[2022-03-03 22:55:46,769] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_41_optim_states.pt [default1]:[2022-03-03 22:55:46,782] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_21_optim_states.pt [default4]:[2022-03-03 22:55:46,771] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_32_optim_states.pt [default1]:[2022-03-03 22:55:46,723] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_41_optim_states.pt [default0]:[2022-03-03 22:55:46,814] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_40_optim_states.pt [default7]:[2022-03-03 22:55:46,775] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_39_optim_states.pt [default3]:[2022-03-03 22:55:46,819] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_39_optim_states.pt [default3]:[2022-03-03 22:55:46,774] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_23_optim_states.pt [default5]:[2022-03-03 22:55:46,829] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_13_optim_states.pt [default3]:[2022-03-03 22:55:46,872] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_23_optim_states.pt [default7]:[2022-03-03 22:55:46,906] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_11_optim_states.pt [default3]:[2022-03-03 22:55:46,952] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_35_optim_states.pt [default7]:[2022-03-03 22:55:46,874] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_43_optim_states.pt [default2]:[2022-03-03 22:55:46,872] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_34_optim_states.pt [default0]:[2022-03-03 22:55:46,921] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_28_optim_states.pt [default4]:[2022-03-03 22:55:46,985] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_28_optim_states.pt [default0]:[2022-03-03 22:55:46,969] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_08_optim_states.pt [default5]:[2022-03-03 22:55:46,995] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_21_optim_states.pt [default6]:[2022-03-03 22:55:46,999] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_10_optim_states.pt [default3]:[2022-03-03 22:55:47,057] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_43_optim_states.pt [default0]:[2022-03-03 22:55:47,082] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_20_optim_states.pt [default2]:[2022-03-03 22:55:47,085] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_42_optim_states.pt [default1]:[2022-03-03 22:55:47,079] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_45_optim_states.pt [default2]:[2022-03-03 22:55:47,085] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_30_optim_states.pt [default0]:[2022-03-03 22:55:47,157] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_20_optim_states.pt [default2]:[2022-03-03 22:55:47,184] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_38_optim_states.pt [default1]:[2022-03-03 22:55:47,145] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_21_optim_states.pt [default0]:[2022-03-03 22:55:47,197] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_44_optim_states.pt [default2]:[2022-03-03 22:55:47,175] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_34_optim_states.pt [default4]:[2022-03-03 22:55:47,229] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_12_optim_states.pt [default2]:[2022-03-03 22:55:47,227] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_42_optim_states.pt [default5]:[2022-03-03 22:55:47,251] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_37_optim_states.pt [default5]:[2022-03-03 22:55:47,307] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_17_optim_states.pt [default4]:[2022-03-03 22:55:47,364] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_16_optim_states.pt [default3]:[2022-03-03 22:55:47,318] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_11_optim_states.pt [default7]:[2022-03-03 22:55:47,308] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_07_optim_states.pt [default7]:[2022-03-03 22:55:47,371] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_15_optim_states.pt [default0]:[2022-03-03 22:55:47,414] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_32_optim_states.pt [default0]:[2022-03-03 22:55:47,384] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_28_optim_states.pt [default2]:[2022-03-03 22:55:47,434] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_22_optim_states.pt [default2]:[2022-03-03 22:55:47,445] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_14_optim_states.pt [default4]:[2022-03-03 22:55:47,447] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_28_optim_states.pt [default1]:[2022-03-03 22:55:47,475] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_09_optim_states.pt [default4]:[2022-03-03 22:55:47,501] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_44_optim_states.pt [default5]:[2022-03-03 22:55:47,566] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_29_optim_states.pt [default5]:[2022-03-03 22:55:47,542] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_25_optim_states.pt [default3]:[2022-03-03 22:55:47,521] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_15_optim_states.pt [default1]:[2022-03-03 22:55:47,617] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_33_optim_states.pt [default7]:[2022-03-03 22:55:47,626] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_35_optim_states.pt [default2]:[2022-03-03 22:55:47,677] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_26_optim_states.pt [default0]:[2022-03-03 22:55:47,638] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_32_optim_states.pt [default4]:[2022-03-03 22:55:47,693] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_04_optim_states.pt [default5]:[2022-03-03 22:55:47,681] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_29_optim_states.pt [default7]:[2022-03-03 22:55:47,749] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_07_optim_states.pt [default5]:[2022-03-03 22:55:47,761] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_17_optim_states.pt [default1]:[2022-03-03 22:55:47,749] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_33_optim_states.pt [default1]:[2022-03-03 22:55:47,755] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_25_optim_states.pt [default6]:[2022-03-03 22:55:47,791] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_14_optim_states.pt [default2]:[2022-03-03 22:55:47,854] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_26_optim_states.pt [default0]:[2022-03-03 22:55:47,835] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_12_optim_states.pt [default1]:[2022-03-03 22:55:47,825] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_05_optim_states.pt [default6]:[2022-03-03 22:55:47,762] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_38_optim_states.pt [default3]:[2022-03-03 22:55:47,918] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_27_optim_states.pt [default7]:[2022-03-03 22:55:47,936] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_27_optim_states.pt [default7]:[2022-03-03 22:55:47,932] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_35_optim_states.pt [default6]:[2022-03-03 22:55:47,939] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_26_optim_states.pt [default0]:[2022-03-03 22:55:47,871] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_32_optim_states.pt [default6]:[2022-03-03 22:55:47,917] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_34_optim_states.pt [default4]:[2022-03-03 22:55:47,994] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_36_optim_states.pt [default5]:[2022-03-03 22:55:47,994] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_37_optim_states.pt [default3]:[2022-03-03 22:55:48,025] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_47_optim_states.pt [default1]:[2022-03-03 22:55:47,991] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_25_optim_states.pt [default1]:[2022-03-03 22:55:48,036] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_13_optim_states.pt [default2]:[2022-03-03 22:55:48,033] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_22_optim_states.pt [default7]:[2022-03-03 22:55:48,025] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_39_optim_states.pt [default6]:[2022-03-03 22:55:48,055] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_06_optim_states.pt [default0]:[2022-03-03 22:55:48,118] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_36_optim_states.pt [default2]:[2022-03-03 22:55:48,076] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_46_optim_states.pt [default0]:[2022-03-03 22:55:48,100] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_04_optim_states.pt [default0]:[2022-03-03 22:55:48,098] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_04_optim_states.pt [default1]:[2022-03-03 22:55:48,136] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_05_optim_states.pt [default5]:[2022-03-03 22:55:48,207] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_45_optim_states.pt [default0]:[2022-03-03 22:55:48,259] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_04_optim_states.pt [default6]:[2022-03-03 22:55:48,257] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_06_optim_states.pt [default5]:[2022-03-03 22:55:48,288] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_41_optim_states.pt [default3]:[2022-03-03 22:55:48,380] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_39_optim_states.pt [default1]:[2022-03-03 22:55:48,435] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_05_optim_states.pt [default2]:[2022-03-03 22:55:48,480] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_38_optim_states.pt [default6]:[2022-03-03 22:55:48,604] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_06_optim_states.pt [default4]:[2022-03-03 22:55:48,652] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_04_optim_states.pt [default4]:[2022-03-03 22:55:48,741] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_36_optim_states.pt [default7]:[2022-03-03 22:55:48,739] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_07_optim_states.pt [default3]:[2022-03-03 22:55:48,697] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_07_optim_states.pt [default6]:[2022-03-03 22:55:48,709] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_38_optim_states.pt [default2]:[2022-03-03 22:55:48,709] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_38_optim_states.pt [default2]:[2022-03-03 22:55:48,828] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_06_optim_states.pt [default0]:[2022-03-03 22:55:48,848] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_36_optim_states.pt [default5]:[2022-03-03 22:55:48,849] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_37_optim_states.pt [default1]:[2022-03-03 22:55:48,829] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_37_optim_states.pt [default0]:[2022-03-03 22:55:48,911] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_04_optim_states.pt [default6]:[2022-03-03 22:55:49,052] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_30_optim_states.pt [default4]:[2022-03-03 22:55:49,079] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_20_optim_states.pt [default1]:[2022-03-03 22:55:48,980] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_05_optim_states.pt [default6]:[2022-03-03 22:55:48,984] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_46_optim_states.pt [default4]:[2022-03-03 22:55:49,026] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_40_optim_states.pt [default7]:[2022-03-03 22:55:49,130] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_31_optim_states.pt [default3]:[2022-03-03 22:55:49,173] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_39_optim_states.pt [default1]:[2022-03-03 22:55:49,106] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_37_optim_states.pt [default2]:[2022-03-03 22:55:49,204] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_22_optim_states.pt [default0]:[2022-03-03 22:55:49,214] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_36_optim_states.pt [default4]:[2022-03-03 22:55:49,283] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_36_optim_states.pt [default0]:[2022-03-03 22:55:49,340] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_40_optim_states.pt [default2]:[2022-03-03 22:55:49,364] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_10_optim_states.pt [default5]:[2022-03-03 22:55:49,344] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_05_optim_states.pt [default2]:[2022-03-03 22:55:49,310] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_14_optim_states.pt [default6]:[2022-03-03 22:55:49,336] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_38_optim_states.pt [default4]:[2022-03-03 22:55:49,364] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_08_optim_states.pt [default3]:[2022-03-03 22:55:49,352] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_15_optim_states.pt [default1]:[2022-03-03 22:55:49,402] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_41_optim_states.pt [default0]:[2022-03-03 22:55:49,482] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_20_optim_states.pt [default6]:[2022-03-03 22:55:49,530] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_22_optim_states.pt [default5]:[2022-03-03 22:55:49,440] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_01_optim_states.pt [default7]:[2022-03-03 22:55:49,597] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_23_optim_states.pt [default3]:[2022-03-03 22:55:49,585] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_23_optim_states.pt [default1]:[2022-03-03 22:55:49,622] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_21_optim_states.pt [default6]:[2022-03-03 22:55:49,586] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_22_optim_states.pt [default6]:[2022-03-03 22:55:49,667] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_02_optim_states.pt [default3]:[2022-03-03 22:55:49,602] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_27_optim_states.pt [default0]:[2022-03-03 22:55:49,651] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [default7]:[2022-03-03 22:55:49,654] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_03_optim_states.pt [default1]:[2022-03-03 22:55:49,657] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_01_optim_states.pt [default3]:[2022-03-03 22:55:49,922] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_23_optim_states.pt [default4]:[2022-03-03 22:55:49,962] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_20_optim_states.pt [default6]:[2022-03-03 22:55:50,059] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_34_optim_states.pt [default5]:[2022-03-03 22:55:50,044] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_21_optim_states.pt [default2]:[2022-03-03 22:55:50,089] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_02_optim_states.pt [default3]:[2022-03-03 22:55:50,035] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_03_optim_states.pt [default7]:[2022-03-03 22:55:50,119] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_23_optim_states.pt [default0]:[2022-03-03 22:55:50,162] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_24_optim_states.pt [default7]:[2022-03-03 22:55:50,206] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_47_optim_states.pt [default2]:[2022-03-03 22:55:50,334] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_06_optim_states.pt [default7]:[2022-03-03 22:55:50,301] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_23_optim_states.pt [default6]:[2022-03-03 22:55:50,374] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_18_optim_states.pt [default1]:[2022-03-03 22:55:50,418] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_45_optim_states.pt [default7]:[2022-03-03 22:55:50,484] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_27_optim_states.pt [default4]:[2022-03-03 22:55:50,589] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt [default6]:[2022-03-03 22:55:50,528] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_26_optim_states.pt [default4]:[2022-03-03 22:55:50,581] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_08_optim_states.pt [default1]:[2022-03-03 22:55:50,590] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_17_optim_states.pt [default4]:[2022-03-03 22:55:50,487] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_16_optim_states.pt [default5]:[2022-03-03 22:55:50,596] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_09_optim_states.pt [default0]:[2022-03-03 22:55:50,639] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_16_optim_states.pt [default6]:[2022-03-03 22:55:50,826] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_22_optim_states.pt [default4]:[2022-03-03 22:55:50,816] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_24_optim_states.pt [default5]:[2022-03-03 22:55:50,836] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_01_optim_states.pt [default5]:[2022-03-03 22:55:50,888] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_21_optim_states.pt [default0]:[2022-03-03 22:55:50,978] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_44_optim_states.pt [default5]:[2022-03-03 22:55:50,686] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_17_optim_states.pt [default7]:[2022-03-03 22:55:51,057] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_19_optim_states.pt [default2]:[2022-03-03 22:55:51,145] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_06_optim_states.pt [default4]:[2022-03-03 22:55:51,074] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_20_optim_states.pt [default7]:[2022-03-03 22:55:51,181] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_39_optim_states.pt [default5]:[2022-03-03 22:55:51,263] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_25_optim_states.pt [default4]:[2022-03-03 22:55:51,294] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_04_optim_states.pt [default3]:[2022-03-03 22:55:51,395] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_07_optim_states.pt [default2]:[2022-03-03 22:55:51,426] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_34_optim_states.pt [default3]:[2022-03-03 22:55:51,498] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_07_optim_states.pt [default3]:[2022-03-03 22:55:51,503] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_35_optim_states.pt [default1]:[2022-03-03 22:55:51,538] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_6_mp_rank_33_optim_states.pt [default2]:[2022-03-03 22:55:51,685] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_38_optim_states.pt [default3]:[2022-03-03 22:55:51,862] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_11_optim_states.pt [default5]:[2022-03-03 22:55:51,831] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_09_optim_states.pt [default5]:[2022-03-03 22:55:51,977] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_33_optim_states.pt [default5]:[2022-03-03 22:55:52,014] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_05_optim_states.pt [default4]:[2022-03-03 22:55:52,010] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_7_mp_rank_32_optim_states.pt [default7]:[2022-03-03 22:55:52,054] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_39_optim_states.pt [default4]:[2022-03-03 22:55:52,080] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_44_optim_states.pt [default1]:[2022-03-03 22:55:52,120] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_37_optim_states.pt [default4]:[2022-03-03 22:55:52,129] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt [default0]:[2022-03-03 22:55:52,109] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_24_optim_states.pt [default1]:[2022-03-03 22:55:52,119] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_17_optim_states.pt [default5]:[2022-03-03 22:55:52,183] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_45_optim_states.pt [default7]:[2022-03-03 22:55:52,186] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_47_optim_states.pt [default2]:[2022-03-03 22:55:52,316] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_02_optim_states.pt [default3]:[2022-03-03 22:55:52,301] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_39_optim_states.pt [default3]:[2022-03-03 22:55:52,286] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_03_optim_states.pt [default1]:[2022-03-03 22:55:52,404] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_01_optim_states.pt [default2]:[2022-03-03 22:55:52,549] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_46_optim_states.pt [default6]:[2022-03-03 22:55:52,581] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_30_optim_states.pt [default3]:[2022-03-03 22:55:52,554] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_4_mp_rank_47_optim_states.pt [default2]:[2022-03-03 22:55:52,539] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_46_optim_states.pt [default0]:[2022-03-03 22:55:52,572] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_16_optim_states.pt [default6]:[2022-03-03 22:55:52,596] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_38_optim_states.pt [default6]:[2022-03-03 22:55:52,594] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_02_optim_states.pt [default7]:[2022-03-03 22:55:52,614] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_31_optim_states.pt [default0]:[2022-03-03 22:55:52,751] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt [default3]:[2022-03-03 22:55:52,736] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_47_optim_states.pt [default0]:[2022-03-03 22:55:52,916] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_0_mp_rank_36_optim_states.pt [default7]:[2022-03-03 22:55:53,243] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_03_optim_states.pt [default4]:[2022-03-03 22:55:53,391] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_36_optim_states.pt [default5]:[2022-03-03 22:55:53,386] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_1_mp_rank_37_optim_states.pt [default6]:[2022-03-03 22:55:53,793] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_46_optim_states.pt [default6]:[2022-03-03 22:55:53,932] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_18_optim_states.pt [default5]:[2022-03-03 22:55:53,920] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_13_optim_states.pt [default4]:[2022-03-03 22:55:53,970] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_12_optim_states.pt [default7]:[2022-03-03 22:55:54,023] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_27_optim_states.pt [default5]:[2022-03-03 22:55:54,033] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_25_optim_states.pt [default4]:[2022-03-03 22:55:54,176] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_24_optim_states.pt [default7]:[2022-03-03 22:55:54,648] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_11_optim_states.pt [default6]:[2022-03-03 22:55:54,577] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_26_optim_states.pt [default6]:[2022-03-03 22:55:54,759] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_10_optim_states.pt [default7]:[2022-03-03 22:55:54,813] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_07_optim_states.pt [default6]:[2022-03-03 22:55:54,870] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_5_mp_rank_06_optim_states.pt [default5]:[2022-03-03 22:55:54,972] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_45_optim_states.pt [default4]:[2022-03-03 22:55:55,042] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_44_optim_states.pt [default7]:[2022-03-03 22:55:55,019] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_3_mp_rank_19_optim_states.pt [default1]:[2022-03-03 22:55:55,298] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_45_optim_states.pt [default0]:[2022-03-03 22:55:55,402] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_44_optim_states.pt [default2]:[2022-03-03 22:55:55,737] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_18_optim_states.pt [default3]:[2022-03-03 22:55:55,784] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_19_optim_states.pt [default0]:[2022-03-03 22:55:56,200] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_08_optim_states.pt [default7]:time (ms) | save-checkpoint: 39581.71 [default0]: successfully saved checkpoint at iteration 4000 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default1]:[2022-03-03 22:55:56,277] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4000/bf16_zero_pp_rank_2_mp_rank_09_optim_states.pt [default7]: iteration 4001/ 128728 | consumed samples: 64016 | consumed tokens: 131104768 | elapsed time per iteration (s): 74.30 | learning rate: 2.098E-05 | global batch size: 16 | lm loss: 5.543649E+00 | grad norm: 1.476 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 0.215 | TFLOPs: 1.65 | [default7]: iteration 4002/ 128728 | consumed samples: 64032 | consumed tokens: 131137536 | elapsed time per iteration (s): 15.26 | learning rate: 2.098E-05 | global batch size: 16 | lm loss: 5.309994E+00 | grad norm: 0.674 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4003/ 128728 | consumed samples: 64048 | consumed tokens: 131170304 | elapsed time per iteration (s): 15.25 | learning rate: 2.099E-05 | global batch size: 16 | lm loss: 5.382421E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4004/ 128728 | consumed samples: 64064 | consumed tokens: 131203072 | elapsed time per iteration (s): 15.21 | learning rate: 2.099E-05 | global batch size: 16 | lm loss: 5.472140E+00 | grad norm: 0.775 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4005/ 128728 | consumed samples: 64080 | consumed tokens: 131235840 | elapsed time per iteration (s): 15.21 | learning rate: 2.100E-05 | global batch size: 16 | lm loss: 5.577133E+00 | grad norm: 0.798 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4006/ 128728 | consumed samples: 64096 | consumed tokens: 131268608 | elapsed time per iteration (s): 15.20 | learning rate: 2.100E-05 | global batch size: 16 | lm loss: 5.207786E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4007/ 128728 | consumed samples: 64112 | consumed tokens: 131301376 | elapsed time per iteration (s): 15.17 | learning rate: 2.101E-05 | global batch size: 16 | lm loss: 5.436874E+00 | grad norm: 0.891 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4008/ 128728 | consumed samples: 64128 | consumed tokens: 131334144 | elapsed time per iteration (s): 15.21 | learning rate: 2.101E-05 | global batch size: 16 | lm loss: 5.084764E+00 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4009/ 128728 | consumed samples: 64144 | consumed tokens: 131366912 | elapsed time per iteration (s): 15.19 | learning rate: 2.102E-05 | global batch size: 16 | lm loss: 5.114124E+00 | grad norm: 1.494 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4010/ 128728 | consumed samples: 64160 | consumed tokens: 131399680 | elapsed time per iteration (s): 15.23 | learning rate: 2.102E-05 | global batch size: 16 | lm loss: 5.451921E+00 | grad norm: 0.778 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4011/ 128728 | consumed samples: 64176 | consumed tokens: 131432448 | elapsed time per iteration (s): 15.21 | learning rate: 2.103E-05 | global batch size: 16 | lm loss: 5.215581E+00 | grad norm: 0.810 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4012/ 128728 | consumed samples: 64192 | consumed tokens: 131465216 | elapsed time per iteration (s): 15.18 | learning rate: 2.103E-05 | global batch size: 16 | lm loss: 5.158479E+00 | grad norm: 0.729 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4013/ 128728 | consumed samples: 64208 | consumed tokens: 131497984 | elapsed time per iteration (s): 15.18 | learning rate: 2.104E-05 | global batch size: 16 | lm loss: 5.238644E+00 | grad norm: 0.907 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4014/ 128728 | consumed samples: 64224 | consumed tokens: 131530752 | elapsed time per iteration (s): 15.29 | learning rate: 2.104E-05 | global batch size: 16 | lm loss: 5.194250E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.01 | [default7]: iteration 4015/ 128728 | consumed samples: 64240 | consumed tokens: 131563520 | elapsed time per iteration (s): 15.23 | learning rate: 2.105E-05 | global batch size: 16 | lm loss: 5.281526E+00 | grad norm: 0.674 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4016/ 128728 | consumed samples: 64256 | consumed tokens: 131596288 | elapsed time per iteration (s): 15.20 | learning rate: 2.106E-05 | global batch size: 16 | lm loss: 5.243568E+00 | grad norm: 1.218 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4017/ 128728 | consumed samples: 64272 | consumed tokens: 131629056 | elapsed time per iteration (s): 15.20 | learning rate: 2.106E-05 | global batch size: 16 | lm loss: 5.439724E+00 | grad norm: 0.948 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4018/ 128728 | consumed samples: 64288 | consumed tokens: 131661824 | elapsed time per iteration (s): 15.23 | learning rate: 2.107E-05 | global batch size: 16 | lm loss: 5.292508E+00 | grad norm: 0.651 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4019/ 128728 | consumed samples: 64304 | consumed tokens: 131694592 | elapsed time per iteration (s): 15.23 | learning rate: 2.107E-05 | global batch size: 16 | lm loss: 5.304052E+00 | grad norm: 0.861 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4020/ 128728 | consumed samples: 64320 | consumed tokens: 131727360 | elapsed time per iteration (s): 15.24 | learning rate: 2.108E-05 | global batch size: 16 | lm loss: 5.270075E+00 | grad norm: 0.995 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4021/ 128728 | consumed samples: 64336 | consumed tokens: 131760128 | elapsed time per iteration (s): 15.21 | learning rate: 2.108E-05 | global batch size: 16 | lm loss: 5.335760E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4022/ 128728 | consumed samples: 64352 | consumed tokens: 131792896 | elapsed time per iteration (s): 15.20 | learning rate: 2.109E-05 | global batch size: 16 | lm loss: 5.347544E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4023/ 128728 | consumed samples: 64368 | consumed tokens: 131825664 | elapsed time per iteration (s): 15.20 | learning rate: 2.109E-05 | global batch size: 16 | lm loss: 5.364645E+00 | grad norm: 0.820 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4024/ 128728 | consumed samples: 64384 | consumed tokens: 131858432 | elapsed time per iteration (s): 15.21 | learning rate: 2.110E-05 | global batch size: 16 | lm loss: 5.381454E+00 | grad norm: 0.852 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4025/ 128728 | consumed samples: 64400 | consumed tokens: 131891200 | elapsed time per iteration (s): 15.20 | learning rate: 2.110E-05 | global batch size: 16 | lm loss: 5.369591E+00 | grad norm: 0.662 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4026/ 128728 | consumed samples: 64416 | consumed tokens: 131923968 | elapsed time per iteration (s): 15.24 | learning rate: 2.111E-05 | global batch size: 16 | lm loss: 5.085846E+00 | grad norm: 0.683 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4027/ 128728 | consumed samples: 64432 | consumed tokens: 131956736 | elapsed time per iteration (s): 15.26 | learning rate: 2.111E-05 | global batch size: 16 | lm loss: 5.172122E+00 | grad norm: 1.055 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4028/ 128728 | consumed samples: 64448 | consumed tokens: 131989504 | elapsed time per iteration (s): 15.17 | learning rate: 2.112E-05 | global batch size: 16 | lm loss: 5.216266E+00 | grad norm: 0.771 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 4029/ 128728 | consumed samples: 64464 | consumed tokens: 132022272 | elapsed time per iteration (s): 15.17 | learning rate: 2.112E-05 | global batch size: 16 | lm loss: 5.658593E+00 | grad norm: 0.798 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 4030/ 128728 | consumed samples: 64480 | consumed tokens: 132055040 | elapsed time per iteration (s): 15.26 | learning rate: 2.113E-05 | global batch size: 16 | lm loss: 5.392428E+00 | grad norm: 1.095 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4031/ 128728 | consumed samples: 64496 | consumed tokens: 132087808 | elapsed time per iteration (s): 15.24 | learning rate: 2.113E-05 | global batch size: 16 | lm loss: 5.504789E+00 | grad norm: 0.780 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4032/ 128728 | consumed samples: 64512 | consumed tokens: 132120576 | elapsed time per iteration (s): 15.23 | learning rate: 2.114E-05 | global batch size: 16 | lm loss: 5.514112E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4033/ 128728 | consumed samples: 64528 | consumed tokens: 132153344 | elapsed time per iteration (s): 15.23 | learning rate: 2.114E-05 | global batch size: 16 | lm loss: 5.394161E+00 | grad norm: 0.828 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4034/ 128728 | consumed samples: 64544 | consumed tokens: 132186112 | elapsed time per iteration (s): 15.20 | learning rate: 2.115E-05 | global batch size: 16 | lm loss: 5.352733E+00 | grad norm: 0.850 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4035/ 128728 | consumed samples: 64560 | consumed tokens: 132218880 | elapsed time per iteration (s): 15.17 | learning rate: 2.116E-05 | global batch size: 16 | lm loss: 5.341866E+00 | grad norm: 0.798 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 4036/ 128728 | consumed samples: 64576 | consumed tokens: 132251648 | elapsed time per iteration (s): 15.21 | learning rate: 2.116E-05 | global batch size: 16 | lm loss: 5.249400E+00 | grad norm: 0.860 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4037/ 128728 | consumed samples: 64592 | consumed tokens: 132284416 | elapsed time per iteration (s): 15.23 | learning rate: 2.117E-05 | global batch size: 16 | lm loss: 5.349155E+00 | grad norm: 0.646 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4038/ 128728 | consumed samples: 64608 | consumed tokens: 132317184 | elapsed time per iteration (s): 15.24 | learning rate: 2.117E-05 | global batch size: 16 | lm loss: 5.406515E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4039/ 128728 | consumed samples: 64624 | consumed tokens: 132349952 | elapsed time per iteration (s): 15.21 | learning rate: 2.118E-05 | global batch size: 16 | lm loss: 5.378917E+00 | grad norm: 0.937 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4040/ 128728 | consumed samples: 64640 | consumed tokens: 132382720 | elapsed time per iteration (s): 15.20 | learning rate: 2.118E-05 | global batch size: 16 | lm loss: 5.407258E+00 | grad norm: 0.780 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4041/ 128728 | consumed samples: 64656 | consumed tokens: 132415488 | elapsed time per iteration (s): 15.23 | learning rate: 2.119E-05 | global batch size: 16 | lm loss: 5.460241E+00 | grad norm: 1.277 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4042/ 128728 | consumed samples: 64672 | consumed tokens: 132448256 | elapsed time per iteration (s): 15.17 | learning rate: 2.119E-05 | global batch size: 16 | lm loss: 5.356334E+00 | grad norm: 0.772 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 4043/ 128728 | consumed samples: 64688 | consumed tokens: 132481024 | elapsed time per iteration (s): 15.23 | learning rate: 2.120E-05 | global batch size: 16 | lm loss: 5.390794E+00 | grad norm: 0.900 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4044/ 128728 | consumed samples: 64704 | consumed tokens: 132513792 | elapsed time per iteration (s): 15.24 | learning rate: 2.120E-05 | global batch size: 16 | lm loss: 5.281722E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4045/ 128728 | consumed samples: 64720 | consumed tokens: 132546560 | elapsed time per iteration (s): 15.19 | learning rate: 2.121E-05 | global batch size: 16 | lm loss: 5.315298E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4046/ 128728 | consumed samples: 64736 | consumed tokens: 132579328 | elapsed time per iteration (s): 15.22 | learning rate: 2.121E-05 | global batch size: 16 | lm loss: 5.433873E+00 | grad norm: 0.695 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4047/ 128728 | consumed samples: 64752 | consumed tokens: 132612096 | elapsed time per iteration (s): 15.24 | learning rate: 2.122E-05 | global batch size: 16 | lm loss: 5.412467E+00 | grad norm: 0.656 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4048/ 128728 | consumed samples: 64768 | consumed tokens: 132644864 | elapsed time per iteration (s): 15.23 | learning rate: 2.122E-05 | global batch size: 16 | lm loss: 5.430539E+00 | grad norm: 0.621 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4049/ 128728 | consumed samples: 64784 | consumed tokens: 132677632 | elapsed time per iteration (s): 15.16 | learning rate: 2.123E-05 | global batch size: 16 | lm loss: 5.519146E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 4050/ 128728 | consumed samples: 64800 | consumed tokens: 132710400 | elapsed time per iteration (s): 15.21 | learning rate: 2.123E-05 | global batch size: 16 | lm loss: 5.257305E+00 | grad norm: 0.694 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4051/ 128728 | consumed samples: 64816 | consumed tokens: 132743168 | elapsed time per iteration (s): 15.23 | learning rate: 2.124E-05 | global batch size: 16 | lm loss: 5.442199E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4052/ 128728 | consumed samples: 64832 | consumed tokens: 132775936 | elapsed time per iteration (s): 15.24 | learning rate: 2.124E-05 | global batch size: 16 | lm loss: 5.309348E+00 | grad norm: 0.974 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4053/ 128728 | consumed samples: 64848 | consumed tokens: 132808704 | elapsed time per iteration (s): 15.23 | learning rate: 2.125E-05 | global batch size: 16 | lm loss: 5.319548E+00 | grad norm: 0.780 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4054/ 128728 | consumed samples: 64864 | consumed tokens: 132841472 | elapsed time per iteration (s): 15.25 | learning rate: 2.125E-05 | global batch size: 16 | lm loss: 5.239305E+00 | grad norm: 1.762 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4055/ 128728 | consumed samples: 64880 | consumed tokens: 132874240 | elapsed time per iteration (s): 15.20 | learning rate: 2.126E-05 | global batch size: 16 | lm loss: 5.223973E+00 | grad norm: 0.807 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4056/ 128728 | consumed samples: 64896 | consumed tokens: 132907008 | elapsed time per iteration (s): 15.21 | learning rate: 2.127E-05 | global batch size: 16 | lm loss: 5.312015E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4057/ 128728 | consumed samples: 64912 | consumed tokens: 132939776 | elapsed time per iteration (s): 15.22 | learning rate: 2.127E-05 | global batch size: 16 | lm loss: 5.303139E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4058/ 128728 | consumed samples: 64928 | consumed tokens: 132972544 | elapsed time per iteration (s): 15.20 | learning rate: 2.128E-05 | global batch size: 16 | lm loss: 5.248675E+00 | grad norm: 0.865 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4059/ 128728 | consumed samples: 64944 | consumed tokens: 133005312 | elapsed time per iteration (s): 15.23 | learning rate: 2.128E-05 | global batch size: 16 | lm loss: 5.210124E+00 | grad norm: 0.959 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4060/ 128728 | consumed samples: 64960 | consumed tokens: 133038080 | elapsed time per iteration (s): 15.21 | learning rate: 2.129E-05 | global batch size: 16 | lm loss: 5.407516E+00 | grad norm: 1.019 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4061/ 128728 | consumed samples: 64976 | consumed tokens: 133070848 | elapsed time per iteration (s): 15.24 | learning rate: 2.129E-05 | global batch size: 16 | lm loss: 5.311096E+00 | grad norm: 0.669 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4062/ 128728 | consumed samples: 64992 | consumed tokens: 133103616 | elapsed time per iteration (s): 15.21 | learning rate: 2.130E-05 | global batch size: 16 | lm loss: 5.263693E+00 | grad norm: 0.775 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4063/ 128728 | consumed samples: 65008 | consumed tokens: 133136384 | elapsed time per iteration (s): 15.24 | learning rate: 2.130E-05 | global batch size: 16 | lm loss: 5.587279E+00 | grad norm: 0.804 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4064/ 128728 | consumed samples: 65024 | consumed tokens: 133169152 | elapsed time per iteration (s): 15.22 | learning rate: 2.131E-05 | global batch size: 16 | lm loss: 5.389854E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4065/ 128728 | consumed samples: 65040 | consumed tokens: 133201920 | elapsed time per iteration (s): 15.22 | learning rate: 2.131E-05 | global batch size: 16 | lm loss: 5.493057E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4066/ 128728 | consumed samples: 65056 | consumed tokens: 133234688 | elapsed time per iteration (s): 15.21 | learning rate: 2.132E-05 | global batch size: 16 | lm loss: 5.206816E+00 | grad norm: 0.963 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4067/ 128728 | consumed samples: 65072 | consumed tokens: 133267456 | elapsed time per iteration (s): 15.23 | learning rate: 2.132E-05 | global batch size: 16 | lm loss: 5.457879E+00 | grad norm: 0.985 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4068/ 128728 | consumed samples: 65088 | consumed tokens: 133300224 | elapsed time per iteration (s): 15.16 | learning rate: 2.133E-05 | global batch size: 16 | lm loss: 5.155887E+00 | grad norm: 0.949 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 4069/ 128728 | consumed samples: 65104 | consumed tokens: 133332992 | elapsed time per iteration (s): 15.22 | learning rate: 2.133E-05 | global batch size: 16 | lm loss: 5.326896E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4070/ 128728 | consumed samples: 65120 | consumed tokens: 133365760 | elapsed time per iteration (s): 15.21 | learning rate: 2.134E-05 | global batch size: 16 | lm loss: 5.390995E+00 | grad norm: 0.786 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4071/ 128728 | consumed samples: 65136 | consumed tokens: 133398528 | elapsed time per iteration (s): 15.21 | learning rate: 2.134E-05 | global batch size: 16 | lm loss: 5.471291E+00 | grad norm: 0.677 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4072/ 128728 | consumed samples: 65152 | consumed tokens: 133431296 | elapsed time per iteration (s): 15.17 | learning rate: 2.135E-05 | global batch size: 16 | lm loss: 5.400915E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 4073/ 128728 | consumed samples: 65168 | consumed tokens: 133464064 | elapsed time per iteration (s): 15.21 | learning rate: 2.135E-05 | global batch size: 16 | lm loss: 5.486737E+00 | grad norm: 0.804 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4074/ 128728 | consumed samples: 65184 | consumed tokens: 133496832 | elapsed time per iteration (s): 15.21 | learning rate: 2.136E-05 | global batch size: 16 | lm loss: 5.547410E+00 | grad norm: 0.906 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4075/ 128728 | consumed samples: 65200 | consumed tokens: 133529600 | elapsed time per iteration (s): 15.24 | learning rate: 2.136E-05 | global batch size: 16 | lm loss: 5.069981E+00 | grad norm: 0.796 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4076/ 128728 | consumed samples: 65216 | consumed tokens: 133562368 | elapsed time per iteration (s): 15.18 | learning rate: 2.137E-05 | global batch size: 16 | lm loss: 5.209479E+00 | grad norm: 0.822 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4077/ 128728 | consumed samples: 65232 | consumed tokens: 133595136 | elapsed time per iteration (s): 15.21 | learning rate: 2.138E-05 | global batch size: 16 | lm loss: 5.274742E+00 | grad norm: 1.531 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4078/ 128728 | consumed samples: 65248 | consumed tokens: 133627904 | elapsed time per iteration (s): 15.22 | learning rate: 2.138E-05 | global batch size: 16 | lm loss: 5.524727E+00 | grad norm: 1.002 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4079/ 128728 | consumed samples: 65264 | consumed tokens: 133660672 | elapsed time per iteration (s): 15.17 | learning rate: 2.139E-05 | global batch size: 16 | lm loss: 5.480323E+00 | grad norm: 0.786 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4080/ 128728 | consumed samples: 65280 | consumed tokens: 133693440 | elapsed time per iteration (s): 15.21 | learning rate: 2.139E-05 | global batch size: 16 | lm loss: 5.410918E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4081/ 128728 | consumed samples: 65296 | consumed tokens: 133726208 | elapsed time per iteration (s): 15.18 | learning rate: 2.140E-05 | global batch size: 16 | lm loss: 5.439363E+00 | grad norm: 0.711 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4082/ 128728 | consumed samples: 65312 | consumed tokens: 133758976 | elapsed time per iteration (s): 15.22 | learning rate: 2.140E-05 | global batch size: 16 | lm loss: 5.484829E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4083/ 128728 | consumed samples: 65328 | consumed tokens: 133791744 | elapsed time per iteration (s): 15.22 | learning rate: 2.141E-05 | global batch size: 16 | lm loss: 5.084867E+00 | grad norm: 1.353 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4084/ 128728 | consumed samples: 65344 | consumed tokens: 133824512 | elapsed time per iteration (s): 15.23 | learning rate: 2.141E-05 | global batch size: 16 | lm loss: 5.351573E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4085/ 128728 | consumed samples: 65360 | consumed tokens: 133857280 | elapsed time per iteration (s): 15.20 | learning rate: 2.142E-05 | global batch size: 16 | lm loss: 5.428648E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4086/ 128728 | consumed samples: 65376 | consumed tokens: 133890048 | elapsed time per iteration (s): 15.23 | learning rate: 2.142E-05 | global batch size: 16 | lm loss: 5.263672E+00 | grad norm: 0.699 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4087/ 128728 | consumed samples: 65392 | consumed tokens: 133922816 | elapsed time per iteration (s): 15.24 | learning rate: 2.143E-05 | global batch size: 16 | lm loss: 5.312733E+00 | grad norm: 5.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4088/ 128728 | consumed samples: 65408 | consumed tokens: 133955584 | elapsed time per iteration (s): 15.15 | learning rate: 2.143E-05 | global batch size: 16 | lm loss: 5.681293E+00 | grad norm: 0.869 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 4089/ 128728 | consumed samples: 65424 | consumed tokens: 133988352 | elapsed time per iteration (s): 15.23 | learning rate: 2.144E-05 | global batch size: 16 | lm loss: 5.367632E+00 | grad norm: 0.692 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4090/ 128728 | consumed samples: 65440 | consumed tokens: 134021120 | elapsed time per iteration (s): 15.22 | learning rate: 2.144E-05 | global batch size: 16 | lm loss: 5.301403E+00 | grad norm: 2.269 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4091/ 128728 | consumed samples: 65456 | consumed tokens: 134053888 | elapsed time per iteration (s): 15.20 | learning rate: 2.145E-05 | global batch size: 16 | lm loss: 5.406228E+00 | grad norm: 0.767 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4092/ 128728 | consumed samples: 65472 | consumed tokens: 134086656 | elapsed time per iteration (s): 15.22 | learning rate: 2.145E-05 | global batch size: 16 | lm loss: 5.559772E+00 | grad norm: 0.674 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4093/ 128728 | consumed samples: 65488 | consumed tokens: 134119424 | elapsed time per iteration (s): 15.25 | learning rate: 2.146E-05 | global batch size: 16 | lm loss: 5.276206E+00 | grad norm: 2.740 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4094/ 128728 | consumed samples: 65504 | consumed tokens: 134152192 | elapsed time per iteration (s): 15.25 | learning rate: 2.146E-05 | global batch size: 16 | lm loss: 5.252672E+00 | grad norm: 0.842 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4095/ 128728 | consumed samples: 65520 | consumed tokens: 134184960 | elapsed time per iteration (s): 15.22 | learning rate: 2.147E-05 | global batch size: 16 | lm loss: 5.704553E+00 | grad norm: 0.956 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4096/ 128728 | consumed samples: 65536 | consumed tokens: 134217728 | elapsed time per iteration (s): 15.24 | learning rate: 2.147E-05 | global batch size: 16 | lm loss: 5.402080E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4097/ 128728 | consumed samples: 65552 | consumed tokens: 134250496 | elapsed time per iteration (s): 15.25 | learning rate: 2.148E-05 | global batch size: 16 | lm loss: 5.459865E+00 | grad norm: 0.736 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4098/ 128728 | consumed samples: 65568 | consumed tokens: 134283264 | elapsed time per iteration (s): 15.22 | learning rate: 2.149E-05 | global batch size: 16 | lm loss: 5.188199E+00 | grad norm: 0.664 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4099/ 128728 | consumed samples: 65584 | consumed tokens: 134316032 | elapsed time per iteration (s): 15.17 | learning rate: 2.149E-05 | global batch size: 16 | lm loss: 5.217367E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 4100/ 128728 | consumed samples: 65600 | consumed tokens: 134348800 | elapsed time per iteration (s): 15.25 | learning rate: 2.150E-05 | global batch size: 16 | lm loss: 5.555455E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4101/ 128728 | consumed samples: 65616 | consumed tokens: 134381568 | elapsed time per iteration (s): 15.26 | learning rate: 2.150E-05 | global batch size: 16 | lm loss: 5.449624E+00 | grad norm: 0.824 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4102/ 128728 | consumed samples: 65632 | consumed tokens: 134414336 | elapsed time per iteration (s): 15.24 | learning rate: 2.151E-05 | global batch size: 16 | lm loss: 5.290594E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4103/ 128728 | consumed samples: 65648 | consumed tokens: 134447104 | elapsed time per iteration (s): 15.19 | learning rate: 2.151E-05 | global batch size: 16 | lm loss: 5.386337E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4104/ 128728 | consumed samples: 65664 | consumed tokens: 134479872 | elapsed time per iteration (s): 15.21 | learning rate: 2.152E-05 | global batch size: 16 | lm loss: 5.386989E+00 | grad norm: 0.899 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4105/ 128728 | consumed samples: 65680 | consumed tokens: 134512640 | elapsed time per iteration (s): 15.20 | learning rate: 2.152E-05 | global batch size: 16 | lm loss: 5.478673E+00 | grad norm: 0.957 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4106/ 128728 | consumed samples: 65696 | consumed tokens: 134545408 | elapsed time per iteration (s): 15.22 | learning rate: 2.153E-05 | global batch size: 16 | lm loss: 5.298902E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4107/ 128728 | consumed samples: 65712 | consumed tokens: 134578176 | elapsed time per iteration (s): 15.24 | learning rate: 2.153E-05 | global batch size: 16 | lm loss: 5.438011E+00 | grad norm: 0.783 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4108/ 128728 | consumed samples: 65728 | consumed tokens: 134610944 | elapsed time per iteration (s): 15.20 | learning rate: 2.154E-05 | global batch size: 16 | lm loss: 5.436982E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4109/ 128728 | consumed samples: 65744 | consumed tokens: 134643712 | elapsed time per iteration (s): 15.16 | learning rate: 2.154E-05 | global batch size: 16 | lm loss: 5.352978E+00 | grad norm: 0.903 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 4110/ 128728 | consumed samples: 65760 | consumed tokens: 134676480 | elapsed time per iteration (s): 15.16 | learning rate: 2.155E-05 | global batch size: 16 | lm loss: 5.435023E+00 | grad norm: 0.853 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 4111/ 128728 | consumed samples: 65776 | consumed tokens: 134709248 | elapsed time per iteration (s): 15.25 | learning rate: 2.155E-05 | global batch size: 16 | lm loss: 5.427981E+00 | grad norm: 3.049 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4112/ 128728 | consumed samples: 65792 | consumed tokens: 134742016 | elapsed time per iteration (s): 15.19 | learning rate: 2.156E-05 | global batch size: 16 | lm loss: 5.224471E+00 | grad norm: 0.772 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4113/ 128728 | consumed samples: 65808 | consumed tokens: 134774784 | elapsed time per iteration (s): 15.21 | learning rate: 2.156E-05 | global batch size: 16 | lm loss: 5.465331E+00 | grad norm: 0.674 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4114/ 128728 | consumed samples: 65824 | consumed tokens: 134807552 | elapsed time per iteration (s): 15.20 | learning rate: 2.157E-05 | global batch size: 16 | lm loss: 5.457709E+00 | grad norm: 0.846 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4115/ 128728 | consumed samples: 65840 | consumed tokens: 134840320 | elapsed time per iteration (s): 15.20 | learning rate: 2.157E-05 | global batch size: 16 | lm loss: 5.527482E+00 | grad norm: 0.889 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4116/ 128728 | consumed samples: 65856 | consumed tokens: 134873088 | elapsed time per iteration (s): 15.20 | learning rate: 2.158E-05 | global batch size: 16 | lm loss: 5.455926E+00 | grad norm: 0.782 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4117/ 128728 | consumed samples: 65872 | consumed tokens: 134905856 | elapsed time per iteration (s): 15.20 | learning rate: 2.158E-05 | global batch size: 16 | lm loss: 5.303745E+00 | grad norm: 0.693 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4118/ 128728 | consumed samples: 65888 | consumed tokens: 134938624 | elapsed time per iteration (s): 15.18 | learning rate: 2.159E-05 | global batch size: 16 | lm loss: 5.071564E+00 | grad norm: 0.853 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4119/ 128728 | consumed samples: 65904 | consumed tokens: 134971392 | elapsed time per iteration (s): 15.22 | learning rate: 2.160E-05 | global batch size: 16 | lm loss: 5.477110E+00 | grad norm: 0.741 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4120/ 128728 | consumed samples: 65920 | consumed tokens: 135004160 | elapsed time per iteration (s): 15.22 | learning rate: 2.160E-05 | global batch size: 16 | lm loss: 5.212381E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4121/ 128728 | consumed samples: 65936 | consumed tokens: 135036928 | elapsed time per iteration (s): 15.20 | learning rate: 2.161E-05 | global batch size: 16 | lm loss: 5.270615E+00 | grad norm: 0.767 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4122/ 128728 | consumed samples: 65952 | consumed tokens: 135069696 | elapsed time per iteration (s): 15.25 | learning rate: 2.161E-05 | global batch size: 16 | lm loss: 5.196959E+00 | grad norm: 0.798 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4123/ 128728 | consumed samples: 65968 | consumed tokens: 135102464 | elapsed time per iteration (s): 15.23 | learning rate: 2.162E-05 | global batch size: 16 | lm loss: 5.202285E+00 | grad norm: 0.921 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4124/ 128728 | consumed samples: 65984 | consumed tokens: 135135232 | elapsed time per iteration (s): 15.24 | learning rate: 2.162E-05 | global batch size: 16 | lm loss: 5.252374E+00 | grad norm: 1.102 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4125/ 128728 | consumed samples: 66000 | consumed tokens: 135168000 | elapsed time per iteration (s): 15.22 | learning rate: 2.163E-05 | global batch size: 16 | lm loss: 5.523699E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4126/ 128728 | consumed samples: 66016 | consumed tokens: 135200768 | elapsed time per iteration (s): 15.25 | learning rate: 2.163E-05 | global batch size: 16 | lm loss: 5.421499E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4127/ 128728 | consumed samples: 66032 | consumed tokens: 135233536 | elapsed time per iteration (s): 15.23 | learning rate: 2.164E-05 | global batch size: 16 | lm loss: 5.364021E+00 | grad norm: 0.676 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4128/ 128728 | consumed samples: 66048 | consumed tokens: 135266304 | elapsed time per iteration (s): 15.28 | learning rate: 2.164E-05 | global batch size: 16 | lm loss: 5.282767E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 4129/ 128728 | consumed samples: 66064 | consumed tokens: 135299072 | elapsed time per iteration (s): 15.23 | learning rate: 2.165E-05 | global batch size: 16 | lm loss: 5.315971E+00 | grad norm: 1.252 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4130/ 128728 | consumed samples: 66080 | consumed tokens: 135331840 | elapsed time per iteration (s): 15.22 | learning rate: 2.165E-05 | global batch size: 16 | lm loss: 5.305749E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4131/ 128728 | consumed samples: 66096 | consumed tokens: 135364608 | elapsed time per iteration (s): 15.24 | learning rate: 2.166E-05 | global batch size: 16 | lm loss: 5.339551E+00 | grad norm: 0.850 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4132/ 128728 | consumed samples: 66112 | consumed tokens: 135397376 | elapsed time per iteration (s): 15.16 | learning rate: 2.166E-05 | global batch size: 16 | lm loss: 5.253937E+00 | grad norm: 0.866 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 4133/ 128728 | consumed samples: 66128 | consumed tokens: 135430144 | elapsed time per iteration (s): 15.22 | learning rate: 2.167E-05 | global batch size: 16 | lm loss: 5.494246E+00 | grad norm: 0.777 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4134/ 128728 | consumed samples: 66144 | consumed tokens: 135462912 | elapsed time per iteration (s): 15.21 | learning rate: 2.167E-05 | global batch size: 16 | lm loss: 5.367308E+00 | grad norm: 1.105 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4135/ 128728 | consumed samples: 66160 | consumed tokens: 135495680 | elapsed time per iteration (s): 15.21 | learning rate: 2.168E-05 | global batch size: 16 | lm loss: 5.511875E+00 | grad norm: 0.878 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4136/ 128728 | consumed samples: 66176 | consumed tokens: 135528448 | elapsed time per iteration (s): 15.13 | learning rate: 2.168E-05 | global batch size: 16 | lm loss: 5.383279E+00 | grad norm: 0.633 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.10 | [default7]: iteration 4137/ 128728 | consumed samples: 66192 | consumed tokens: 135561216 | elapsed time per iteration (s): 15.20 | learning rate: 2.169E-05 | global batch size: 16 | lm loss: 5.457438E+00 | grad norm: 0.905 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4138/ 128728 | consumed samples: 66208 | consumed tokens: 135593984 | elapsed time per iteration (s): 15.20 | learning rate: 2.170E-05 | global batch size: 16 | lm loss: 5.423141E+00 | grad norm: 0.780 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4139/ 128728 | consumed samples: 66224 | consumed tokens: 135626752 | elapsed time per iteration (s): 15.18 | learning rate: 2.170E-05 | global batch size: 16 | lm loss: 5.250698E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4140/ 128728 | consumed samples: 66240 | consumed tokens: 135659520 | elapsed time per iteration (s): 15.23 | learning rate: 2.171E-05 | global batch size: 16 | lm loss: 5.458531E+00 | grad norm: 1.062 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4141/ 128728 | consumed samples: 66256 | consumed tokens: 135692288 | elapsed time per iteration (s): 15.22 | learning rate: 2.171E-05 | global batch size: 16 | lm loss: 5.207209E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4142/ 128728 | consumed samples: 66272 | consumed tokens: 135725056 | elapsed time per iteration (s): 15.23 | learning rate: 2.172E-05 | global batch size: 16 | lm loss: 5.151188E+00 | grad norm: 0.978 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4143/ 128728 | consumed samples: 66288 | consumed tokens: 135757824 | elapsed time per iteration (s): 15.15 | learning rate: 2.172E-05 | global batch size: 16 | lm loss: 5.418912E+00 | grad norm: 0.928 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 4144/ 128728 | consumed samples: 66304 | consumed tokens: 135790592 | elapsed time per iteration (s): 15.24 | learning rate: 2.173E-05 | global batch size: 16 | lm loss: 5.352671E+00 | grad norm: 0.829 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4145/ 128728 | consumed samples: 66320 | consumed tokens: 135823360 | elapsed time per iteration (s): 15.23 | learning rate: 2.173E-05 | global batch size: 16 | lm loss: 5.245791E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4146/ 128728 | consumed samples: 66336 | consumed tokens: 135856128 | elapsed time per iteration (s): 15.17 | learning rate: 2.174E-05 | global batch size: 16 | lm loss: 5.323851E+00 | grad norm: 0.871 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 4147/ 128728 | consumed samples: 66352 | consumed tokens: 135888896 | elapsed time per iteration (s): 15.23 | learning rate: 2.174E-05 | global batch size: 16 | lm loss: 5.287149E+00 | grad norm: 0.673 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4148/ 128728 | consumed samples: 66368 | consumed tokens: 135921664 | elapsed time per iteration (s): 15.22 | learning rate: 2.175E-05 | global batch size: 16 | lm loss: 5.266491E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4149/ 128728 | consumed samples: 66384 | consumed tokens: 135954432 | elapsed time per iteration (s): 15.23 | learning rate: 2.175E-05 | global batch size: 16 | lm loss: 5.090036E+00 | grad norm: 0.710 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4150/ 128728 | consumed samples: 66400 | consumed tokens: 135987200 | elapsed time per iteration (s): 15.24 | learning rate: 2.176E-05 | global batch size: 16 | lm loss: 5.160160E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4151/ 128728 | consumed samples: 66416 | consumed tokens: 136019968 | elapsed time per iteration (s): 15.23 | learning rate: 2.176E-05 | global batch size: 16 | lm loss: 5.263034E+00 | grad norm: 0.865 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4152/ 128728 | consumed samples: 66432 | consumed tokens: 136052736 | elapsed time per iteration (s): 15.20 | learning rate: 2.177E-05 | global batch size: 16 | lm loss: 5.461198E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4153/ 128728 | consumed samples: 66448 | consumed tokens: 136085504 | elapsed time per iteration (s): 15.22 | learning rate: 2.177E-05 | global batch size: 16 | lm loss: 5.331557E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4154/ 128728 | consumed samples: 66464 | consumed tokens: 136118272 | elapsed time per iteration (s): 15.21 | learning rate: 2.178E-05 | global batch size: 16 | lm loss: 5.365318E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4155/ 128728 | consumed samples: 66480 | consumed tokens: 136151040 | elapsed time per iteration (s): 15.23 | learning rate: 2.178E-05 | global batch size: 16 | lm loss: 5.274574E+00 | grad norm: 0.822 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4156/ 128728 | consumed samples: 66496 | consumed tokens: 136183808 | elapsed time per iteration (s): 15.19 | learning rate: 2.179E-05 | global batch size: 16 | lm loss: 5.311491E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4157/ 128728 | consumed samples: 66512 | consumed tokens: 136216576 | elapsed time per iteration (s): 15.20 | learning rate: 2.179E-05 | global batch size: 16 | lm loss: 5.222072E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4158/ 128728 | consumed samples: 66528 | consumed tokens: 136249344 | elapsed time per iteration (s): 15.24 | learning rate: 2.180E-05 | global batch size: 16 | lm loss: 5.269310E+00 | grad norm: 0.771 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4159/ 128728 | consumed samples: 66544 | consumed tokens: 136282112 | elapsed time per iteration (s): 15.25 | learning rate: 2.181E-05 | global batch size: 16 | lm loss: 5.600447E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4160/ 128728 | consumed samples: 66560 | consumed tokens: 136314880 | elapsed time per iteration (s): 15.20 | learning rate: 2.181E-05 | global batch size: 16 | lm loss: 5.225094E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4161/ 128728 | consumed samples: 66576 | consumed tokens: 136347648 | elapsed time per iteration (s): 15.25 | learning rate: 2.182E-05 | global batch size: 16 | lm loss: 5.211500E+00 | grad norm: 0.992 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4162/ 128728 | consumed samples: 66592 | consumed tokens: 136380416 | elapsed time per iteration (s): 15.21 | learning rate: 2.182E-05 | global batch size: 16 | lm loss: 5.321060E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4163/ 128728 | consumed samples: 66608 | consumed tokens: 136413184 | elapsed time per iteration (s): 15.25 | learning rate: 2.183E-05 | global batch size: 16 | lm loss: 5.462878E+00 | grad norm: 0.776 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4164/ 128728 | consumed samples: 66624 | consumed tokens: 136445952 | elapsed time per iteration (s): 15.16 | learning rate: 2.183E-05 | global batch size: 16 | lm loss: 5.297573E+00 | grad norm: 0.826 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 4165/ 128728 | consumed samples: 66640 | consumed tokens: 136478720 | elapsed time per iteration (s): 15.25 | learning rate: 2.184E-05 | global batch size: 16 | lm loss: 5.334221E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4166/ 128728 | consumed samples: 66656 | consumed tokens: 136511488 | elapsed time per iteration (s): 15.18 | learning rate: 2.184E-05 | global batch size: 16 | lm loss: 5.570589E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4167/ 128728 | consumed samples: 66672 | consumed tokens: 136544256 | elapsed time per iteration (s): 15.21 | learning rate: 2.185E-05 | global batch size: 16 | lm loss: 5.293012E+00 | grad norm: 0.923 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4168/ 128728 | consumed samples: 66688 | consumed tokens: 136577024 | elapsed time per iteration (s): 15.23 | learning rate: 2.185E-05 | global batch size: 16 | lm loss: 5.266202E+00 | grad norm: 0.837 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4169/ 128728 | consumed samples: 66704 | consumed tokens: 136609792 | elapsed time per iteration (s): 15.17 | learning rate: 2.186E-05 | global batch size: 16 | lm loss: 5.267851E+00 | grad norm: 0.652 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 4170/ 128728 | consumed samples: 66720 | consumed tokens: 136642560 | elapsed time per iteration (s): 15.17 | learning rate: 2.186E-05 | global batch size: 16 | lm loss: 5.526597E+00 | grad norm: 0.805 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 4171/ 128728 | consumed samples: 66736 | consumed tokens: 136675328 | elapsed time per iteration (s): 15.16 | learning rate: 2.187E-05 | global batch size: 16 | lm loss: 5.368105E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 4172/ 128728 | consumed samples: 66752 | consumed tokens: 136708096 | elapsed time per iteration (s): 15.18 | learning rate: 2.187E-05 | global batch size: 16 | lm loss: 5.369236E+00 | grad norm: 0.679 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4173/ 128728 | consumed samples: 66768 | consumed tokens: 136740864 | elapsed time per iteration (s): 15.16 | learning rate: 2.188E-05 | global batch size: 16 | lm loss: 5.310276E+00 | grad norm: 0.682 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 4174/ 128728 | consumed samples: 66784 | consumed tokens: 136773632 | elapsed time per iteration (s): 15.20 | learning rate: 2.188E-05 | global batch size: 16 | lm loss: 5.394924E+00 | grad norm: 0.800 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4175/ 128728 | consumed samples: 66800 | consumed tokens: 136806400 | elapsed time per iteration (s): 15.20 | learning rate: 2.189E-05 | global batch size: 16 | lm loss: 5.482425E+00 | grad norm: 0.673 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4176/ 128728 | consumed samples: 66816 | consumed tokens: 136839168 | elapsed time per iteration (s): 15.26 | learning rate: 2.189E-05 | global batch size: 16 | lm loss: 5.236983E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4177/ 128728 | consumed samples: 66832 | consumed tokens: 136871936 | elapsed time per iteration (s): 15.21 | learning rate: 2.190E-05 | global batch size: 16 | lm loss: 5.265677E+00 | grad norm: 0.847 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4178/ 128728 | consumed samples: 66848 | consumed tokens: 136904704 | elapsed time per iteration (s): 15.27 | learning rate: 2.190E-05 | global batch size: 16 | lm loss: 5.455791E+00 | grad norm: 1.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 4179/ 128728 | consumed samples: 66864 | consumed tokens: 136937472 | elapsed time per iteration (s): 15.19 | learning rate: 2.191E-05 | global batch size: 16 | lm loss: 5.398285E+00 | grad norm: 0.817 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4180/ 128728 | consumed samples: 66880 | consumed tokens: 136970240 | elapsed time per iteration (s): 15.27 | learning rate: 2.192E-05 | global batch size: 16 | lm loss: 5.405532E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 4181/ 128728 | consumed samples: 66896 | consumed tokens: 137003008 | elapsed time per iteration (s): 15.19 | learning rate: 2.192E-05 | global batch size: 16 | lm loss: 5.397069E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 4182/ 128728 | consumed samples: 66912 | consumed tokens: 137035776 | elapsed time per iteration (s): 15.15 | learning rate: 2.193E-05 | global batch size: 16 | lm loss: 5.299715E+00 | grad norm: 0.778 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 4183/ 128728 | consumed samples: 66928 | consumed tokens: 137068544 | elapsed time per iteration (s): 15.15 | learning rate: 2.193E-05 | global batch size: 16 | lm loss: 5.255721E+00 | grad norm: 0.696 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 4184/ 128728 | consumed samples: 66944 | consumed tokens: 137101312 | elapsed time per iteration (s): 15.17 | learning rate: 2.194E-05 | global batch size: 16 | lm loss: 5.327215E+00 | grad norm: 0.948 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 4185/ 128728 | consumed samples: 66960 | consumed tokens: 137134080 | elapsed time per iteration (s): 15.22 | learning rate: 2.194E-05 | global batch size: 16 | lm loss: 5.332559E+00 | grad norm: 0.821 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4186/ 128728 | consumed samples: 66976 | consumed tokens: 137166848 | elapsed time per iteration (s): 15.21 | learning rate: 2.195E-05 | global batch size: 16 | lm loss: 5.161037E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4187/ 128728 | consumed samples: 66992 | consumed tokens: 137199616 | elapsed time per iteration (s): 15.17 | learning rate: 2.195E-05 | global batch size: 16 | lm loss: 5.237501E+00 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 4188/ 128728 | consumed samples: 67008 | consumed tokens: 137232384 | elapsed time per iteration (s): 15.19 | learning rate: 2.196E-05 | global batch size: 16 | lm loss: 5.113091E+00 | grad norm: 0.841 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4189/ 128728 | consumed samples: 67024 | consumed tokens: 137265152 | elapsed time per iteration (s): 15.19 | learning rate: 2.196E-05 | global batch size: 16 | lm loss: 5.165911E+00 | grad norm: 0.767 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4190/ 128728 | consumed samples: 67040 | consumed tokens: 137297920 | elapsed time per iteration (s): 15.15 | learning rate: 2.197E-05 | global batch size: 16 | lm loss: 5.443858E+00 | grad norm: 0.696 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 4191/ 128728 | consumed samples: 67056 | consumed tokens: 137330688 | elapsed time per iteration (s): 15.15 | learning rate: 2.197E-05 | global batch size: 16 | lm loss: 5.206475E+00 | grad norm: 0.846 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 4192/ 128728 | consumed samples: 67072 | consumed tokens: 137363456 | elapsed time per iteration (s): 15.18 | learning rate: 2.198E-05 | global batch size: 16 | lm loss: 5.317861E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4193/ 128728 | consumed samples: 67088 | consumed tokens: 137396224 | elapsed time per iteration (s): 15.21 | learning rate: 2.198E-05 | global batch size: 16 | lm loss: 5.115374E+00 | grad norm: 0.814 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4194/ 128728 | consumed samples: 67104 | consumed tokens: 137428992 | elapsed time per iteration (s): 15.23 | learning rate: 2.199E-05 | global batch size: 16 | lm loss: 5.261423E+00 | grad norm: 0.826 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4195/ 128728 | consumed samples: 67120 | consumed tokens: 137461760 | elapsed time per iteration (s): 15.24 | learning rate: 2.199E-05 | global batch size: 16 | lm loss: 5.248822E+00 | grad norm: 0.780 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4196/ 128728 | consumed samples: 67136 | consumed tokens: 137494528 | elapsed time per iteration (s): 15.18 | learning rate: 2.200E-05 | global batch size: 16 | lm loss: 5.515530E+00 | grad norm: 0.957 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4197/ 128728 | consumed samples: 67152 | consumed tokens: 137527296 | elapsed time per iteration (s): 15.18 | learning rate: 2.200E-05 | global batch size: 16 | lm loss: 5.335524E+00 | grad norm: 0.825 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4198/ 128728 | consumed samples: 67168 | consumed tokens: 137560064 | elapsed time per iteration (s): 15.23 | learning rate: 2.201E-05 | global batch size: 16 | lm loss: 5.289407E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4199/ 128728 | consumed samples: 67184 | consumed tokens: 137592832 | elapsed time per iteration (s): 15.24 | learning rate: 2.201E-05 | global batch size: 16 | lm loss: 5.457632E+00 | grad norm: 1.756 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4200/ 128728 | consumed samples: 67200 | consumed tokens: 137625600 | elapsed time per iteration (s): 15.24 | learning rate: 2.202E-05 | global batch size: 16 | lm loss: 5.170599E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4201/ 128728 | consumed samples: 67216 | consumed tokens: 137658368 | elapsed time per iteration (s): 15.21 | learning rate: 2.203E-05 | global batch size: 16 | lm loss: 5.229961E+00 | grad norm: 0.699 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4202/ 128728 | consumed samples: 67232 | consumed tokens: 137691136 | elapsed time per iteration (s): 15.28 | learning rate: 2.203E-05 | global batch size: 16 | lm loss: 5.323138E+00 | grad norm: 0.671 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 4203/ 128728 | consumed samples: 67248 | consumed tokens: 137723904 | elapsed time per iteration (s): 15.25 | learning rate: 2.204E-05 | global batch size: 16 | lm loss: 5.334191E+00 | grad norm: 2.369 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4204/ 128728 | consumed samples: 67264 | consumed tokens: 137756672 | elapsed time per iteration (s): 15.20 | learning rate: 2.204E-05 | global batch size: 16 | lm loss: 5.436996E+00 | grad norm: 0.767 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4205/ 128728 | consumed samples: 67280 | consumed tokens: 137789440 | elapsed time per iteration (s): 15.19 | learning rate: 2.205E-05 | global batch size: 16 | lm loss: 5.285421E+00 | grad norm: 0.666 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 4206/ 128728 | consumed samples: 67296 | consumed tokens: 137822208 | elapsed time per iteration (s): 15.16 | learning rate: 2.205E-05 | global batch size: 16 | lm loss: 5.376272E+00 | grad norm: 1.010 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 4207/ 128728 | consumed samples: 67312 | consumed tokens: 137854976 | elapsed time per iteration (s): 15.25 | learning rate: 2.206E-05 | global batch size: 16 | lm loss: 5.097405E+00 | grad norm: 0.882 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4208/ 128728 | consumed samples: 67328 | consumed tokens: 137887744 | elapsed time per iteration (s): 15.19 | learning rate: 2.206E-05 | global batch size: 16 | lm loss: 5.426728E+00 | grad norm: 0.954 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4209/ 128728 | consumed samples: 67344 | consumed tokens: 137920512 | elapsed time per iteration (s): 15.24 | learning rate: 2.207E-05 | global batch size: 16 | lm loss: 5.375102E+00 | grad norm: 0.917 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4210/ 128728 | consumed samples: 67360 | consumed tokens: 137953280 | elapsed time per iteration (s): 15.21 | learning rate: 2.207E-05 | global batch size: 16 | lm loss: 5.303322E+00 | grad norm: 0.826 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4211/ 128728 | consumed samples: 67376 | consumed tokens: 137986048 | elapsed time per iteration (s): 15.26 | learning rate: 2.208E-05 | global batch size: 16 | lm loss: 5.251130E+00 | grad norm: 1.012 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 4212/ 128728 | consumed samples: 67392 | consumed tokens: 138018816 | elapsed time per iteration (s): 15.24 | learning rate: 2.208E-05 | global batch size: 16 | lm loss: 5.580229E+00 | grad norm: 0.678 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4213/ 128728 | consumed samples: 67408 | consumed tokens: 138051584 | elapsed time per iteration (s): 15.23 | learning rate: 2.209E-05 | global batch size: 16 | lm loss: 5.441099E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4214/ 128728 | consumed samples: 67424 | consumed tokens: 138084352 | elapsed time per iteration (s): 15.25 | learning rate: 2.209E-05 | global batch size: 16 | lm loss: 5.526458E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4215/ 128728 | consumed samples: 67440 | consumed tokens: 138117120 | elapsed time per iteration (s): 15.19 | learning rate: 2.210E-05 | global batch size: 16 | lm loss: 5.415507E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4216/ 128728 | consumed samples: 67456 | consumed tokens: 138149888 | elapsed time per iteration (s): 15.28 | learning rate: 2.210E-05 | global batch size: 16 | lm loss: 5.300536E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 4217/ 128728 | consumed samples: 67472 | consumed tokens: 138182656 | elapsed time per iteration (s): 15.24 | learning rate: 2.211E-05 | global batch size: 16 | lm loss: 5.354405E+00 | grad norm: 0.676 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4218/ 128728 | consumed samples: 67488 | consumed tokens: 138215424 | elapsed time per iteration (s): 15.22 | learning rate: 2.211E-05 | global batch size: 16 | lm loss: 5.247156E+00 | grad norm: 0.860 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4219/ 128728 | consumed samples: 67504 | consumed tokens: 138248192 | elapsed time per iteration (s): 15.21 | learning rate: 2.212E-05 | global batch size: 16 | lm loss: 5.200278E+00 | grad norm: 0.920 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4220/ 128728 | consumed samples: 67520 | consumed tokens: 138280960 | elapsed time per iteration (s): 15.21 | learning rate: 2.213E-05 | global batch size: 16 | lm loss: 5.260693E+00 | grad norm: 0.821 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4221/ 128728 | consumed samples: 67536 | consumed tokens: 138313728 | elapsed time per iteration (s): 15.22 | learning rate: 2.213E-05 | global batch size: 16 | lm loss: 5.003216E+00 | grad norm: 0.887 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4222/ 128728 | consumed samples: 67552 | consumed tokens: 138346496 | elapsed time per iteration (s): 15.21 | learning rate: 2.214E-05 | global batch size: 16 | lm loss: 5.429131E+00 | grad norm: 1.260 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4223/ 128728 | consumed samples: 67568 | consumed tokens: 138379264 | elapsed time per iteration (s): 15.22 | learning rate: 2.214E-05 | global batch size: 16 | lm loss: 5.260954E+00 | grad norm: 0.699 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4224/ 128728 | consumed samples: 67584 | consumed tokens: 138412032 | elapsed time per iteration (s): 15.20 | learning rate: 2.215E-05 | global batch size: 16 | lm loss: 5.218945E+00 | grad norm: 3.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4225/ 128728 | consumed samples: 67600 | consumed tokens: 138444800 | elapsed time per iteration (s): 15.25 | learning rate: 2.215E-05 | global batch size: 16 | lm loss: 5.612597E+00 | grad norm: 1.303 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4226/ 128728 | consumed samples: 67616 | consumed tokens: 138477568 | elapsed time per iteration (s): 15.22 | learning rate: 2.216E-05 | global batch size: 16 | lm loss: 5.233457E+00 | grad norm: 0.686 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4227/ 128728 | consumed samples: 67632 | consumed tokens: 138510336 | elapsed time per iteration (s): 15.20 | learning rate: 2.216E-05 | global batch size: 16 | lm loss: 5.089907E+00 | grad norm: 0.970 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4228/ 128728 | consumed samples: 67648 | consumed tokens: 138543104 | elapsed time per iteration (s): 15.25 | learning rate: 2.217E-05 | global batch size: 16 | lm loss: 5.520075E+00 | grad norm: 0.871 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4229/ 128728 | consumed samples: 67664 | consumed tokens: 138575872 | elapsed time per iteration (s): 15.19 | learning rate: 2.217E-05 | global batch size: 16 | lm loss: 5.053356E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4230/ 128728 | consumed samples: 67680 | consumed tokens: 138608640 | elapsed time per iteration (s): 15.24 | learning rate: 2.218E-05 | global batch size: 16 | lm loss: 5.406147E+00 | grad norm: 0.787 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4231/ 128728 | consumed samples: 67696 | consumed tokens: 138641408 | elapsed time per iteration (s): 15.20 | learning rate: 2.218E-05 | global batch size: 16 | lm loss: 5.376842E+00 | grad norm: 0.913 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4232/ 128728 | consumed samples: 67712 | consumed tokens: 138674176 | elapsed time per iteration (s): 15.21 | learning rate: 2.219E-05 | global batch size: 16 | lm loss: 5.179604E+00 | grad norm: 0.772 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4233/ 128728 | consumed samples: 67728 | consumed tokens: 138706944 | elapsed time per iteration (s): 15.24 | learning rate: 2.219E-05 | global batch size: 16 | lm loss: 5.316276E+00 | grad norm: 1.087 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4234/ 128728 | consumed samples: 67744 | consumed tokens: 138739712 | elapsed time per iteration (s): 15.20 | learning rate: 2.220E-05 | global batch size: 16 | lm loss: 5.243623E+00 | grad norm: 1.276 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4235/ 128728 | consumed samples: 67760 | consumed tokens: 138772480 | elapsed time per iteration (s): 15.23 | learning rate: 2.220E-05 | global batch size: 16 | lm loss: 5.085675E+00 | grad norm: 1.324 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4236/ 128728 | consumed samples: 67776 | consumed tokens: 138805248 | elapsed time per iteration (s): 15.24 | learning rate: 2.221E-05 | global batch size: 16 | lm loss: 5.159794E+00 | grad norm: 0.839 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4237/ 128728 | consumed samples: 67792 | consumed tokens: 138838016 | elapsed time per iteration (s): 15.17 | learning rate: 2.221E-05 | global batch size: 16 | lm loss: 5.064829E+00 | grad norm: 0.900 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4238/ 128728 | consumed samples: 67808 | consumed tokens: 138870784 | elapsed time per iteration (s): 15.22 | learning rate: 2.222E-05 | global batch size: 16 | lm loss: 5.373168E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4239/ 128728 | consumed samples: 67824 | consumed tokens: 138903552 | elapsed time per iteration (s): 15.25 | learning rate: 2.222E-05 | global batch size: 16 | lm loss: 5.072435E+00 | grad norm: 0.663 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4240/ 128728 | consumed samples: 67840 | consumed tokens: 138936320 | elapsed time per iteration (s): 15.23 | learning rate: 2.223E-05 | global batch size: 16 | lm loss: 5.378523E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4241/ 128728 | consumed samples: 67856 | consumed tokens: 138969088 | elapsed time per iteration (s): 15.24 | learning rate: 2.224E-05 | global batch size: 16 | lm loss: 5.313819E+00 | grad norm: 0.980 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4242/ 128728 | consumed samples: 67872 | consumed tokens: 139001856 | elapsed time per iteration (s): 15.22 | learning rate: 2.224E-05 | global batch size: 16 | lm loss: 5.239834E+00 | grad norm: 0.995 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4243/ 128728 | consumed samples: 67888 | consumed tokens: 139034624 | elapsed time per iteration (s): 15.25 | learning rate: 2.225E-05 | global batch size: 16 | lm loss: 5.422865E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4244/ 128728 | consumed samples: 67904 | consumed tokens: 139067392 | elapsed time per iteration (s): 15.15 | learning rate: 2.225E-05 | global batch size: 16 | lm loss: 5.566054E+00 | grad norm: 0.670 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 4245/ 128728 | consumed samples: 67920 | consumed tokens: 139100160 | elapsed time per iteration (s): 15.15 | learning rate: 2.226E-05 | global batch size: 16 | lm loss: 5.035063E+00 | grad norm: 0.788 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 4246/ 128728 | consumed samples: 67936 | consumed tokens: 139132928 | elapsed time per iteration (s): 15.27 | learning rate: 2.226E-05 | global batch size: 16 | lm loss: 5.305432E+00 | grad norm: 0.907 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 4247/ 128728 | consumed samples: 67952 | consumed tokens: 139165696 | elapsed time per iteration (s): 15.17 | learning rate: 2.227E-05 | global batch size: 16 | lm loss: 5.268905E+00 | grad norm: 0.869 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 4248/ 128728 | consumed samples: 67968 | consumed tokens: 139198464 | elapsed time per iteration (s): 15.17 | learning rate: 2.227E-05 | global batch size: 16 | lm loss: 5.425210E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 4249/ 128728 | consumed samples: 67984 | consumed tokens: 139231232 | elapsed time per iteration (s): 15.18 | learning rate: 2.228E-05 | global batch size: 16 | lm loss: 5.495585E+00 | grad norm: 0.925 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4250/ 128728 | consumed samples: 68000 | consumed tokens: 139264000 | elapsed time per iteration (s): 15.15 | learning rate: 2.228E-05 | global batch size: 16 | lm loss: 5.291341E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 4251/ 128728 | consumed samples: 68016 | consumed tokens: 139296768 | elapsed time per iteration (s): 15.18 | learning rate: 2.229E-05 | global batch size: 16 | lm loss: 5.175395E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4252/ 128728 | consumed samples: 68032 | consumed tokens: 139329536 | elapsed time per iteration (s): 15.22 | learning rate: 2.229E-05 | global batch size: 16 | lm loss: 5.436680E+00 | grad norm: 0.869 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4253/ 128728 | consumed samples: 68048 | consumed tokens: 139362304 | elapsed time per iteration (s): 15.21 | learning rate: 2.230E-05 | global batch size: 16 | lm loss: 5.096869E+00 | grad norm: 2.073 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4254/ 128728 | consumed samples: 68064 | consumed tokens: 139395072 | elapsed time per iteration (s): 15.25 | learning rate: 2.230E-05 | global batch size: 16 | lm loss: 5.111172E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4255/ 128728 | consumed samples: 68080 | consumed tokens: 139427840 | elapsed time per iteration (s): 15.22 | learning rate: 2.231E-05 | global batch size: 16 | lm loss: 5.014842E+00 | grad norm: 0.930 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4256/ 128728 | consumed samples: 68096 | consumed tokens: 139460608 | elapsed time per iteration (s): 15.17 | learning rate: 2.231E-05 | global batch size: 16 | lm loss: 5.302151E+00 | grad norm: 0.839 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 4257/ 128728 | consumed samples: 68112 | consumed tokens: 139493376 | elapsed time per iteration (s): 15.18 | learning rate: 2.232E-05 | global batch size: 16 | lm loss: 5.351344E+00 | grad norm: 0.842 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4258/ 128728 | consumed samples: 68128 | consumed tokens: 139526144 | elapsed time per iteration (s): 15.23 | learning rate: 2.232E-05 | global batch size: 16 | lm loss: 5.253459E+00 | grad norm: 0.751 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4259/ 128728 | consumed samples: 68144 | consumed tokens: 139558912 | elapsed time per iteration (s): 15.15 | learning rate: 2.233E-05 | global batch size: 16 | lm loss: 5.244567E+00 | grad norm: 0.698 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 4260/ 128728 | consumed samples: 68160 | consumed tokens: 139591680 | elapsed time per iteration (s): 15.25 | learning rate: 2.233E-05 | global batch size: 16 | lm loss: 5.337202E+00 | grad norm: 0.800 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4261/ 128728 | consumed samples: 68176 | consumed tokens: 139624448 | elapsed time per iteration (s): 15.20 | learning rate: 2.234E-05 | global batch size: 16 | lm loss: 5.356158E+00 | grad norm: 0.679 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4262/ 128728 | consumed samples: 68192 | consumed tokens: 139657216 | elapsed time per iteration (s): 15.25 | learning rate: 2.235E-05 | global batch size: 16 | lm loss: 5.314350E+00 | grad norm: 0.870 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4263/ 128728 | consumed samples: 68208 | consumed tokens: 139689984 | elapsed time per iteration (s): 15.23 | learning rate: 2.235E-05 | global batch size: 16 | lm loss: 5.277968E+00 | grad norm: 0.867 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4264/ 128728 | consumed samples: 68224 | consumed tokens: 139722752 | elapsed time per iteration (s): 15.19 | learning rate: 2.236E-05 | global batch size: 16 | lm loss: 5.386879E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 4265/ 128728 | consumed samples: 68240 | consumed tokens: 139755520 | elapsed time per iteration (s): 15.19 | learning rate: 2.236E-05 | global batch size: 16 | lm loss: 5.298102E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 4266/ 128728 | consumed samples: 68256 | consumed tokens: 139788288 | elapsed time per iteration (s): 15.20 | learning rate: 2.237E-05 | global batch size: 16 | lm loss: 5.063458E+00 | grad norm: 1.764 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4267/ 128728 | consumed samples: 68272 | consumed tokens: 139821056 | elapsed time per iteration (s): 15.22 | learning rate: 2.237E-05 | global batch size: 16 | lm loss: 5.150900E+00 | grad norm: 0.898 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4268/ 128728 | consumed samples: 68288 | consumed tokens: 139853824 | elapsed time per iteration (s): 15.24 | learning rate: 2.238E-05 | global batch size: 16 | lm loss: 5.480645E+00 | grad norm: 1.452 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4269/ 128728 | consumed samples: 68304 | consumed tokens: 139886592 | elapsed time per iteration (s): 15.21 | learning rate: 2.238E-05 | global batch size: 16 | lm loss: 5.393959E+00 | grad norm: 0.736 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4270/ 128728 | consumed samples: 68320 | consumed tokens: 139919360 | elapsed time per iteration (s): 15.21 | learning rate: 2.239E-05 | global batch size: 16 | lm loss: 5.195272E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4271/ 128728 | consumed samples: 68336 | consumed tokens: 139952128 | elapsed time per iteration (s): 15.22 | learning rate: 2.239E-05 | global batch size: 16 | lm loss: 5.329949E+00 | grad norm: 0.688 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4272/ 128728 | consumed samples: 68352 | consumed tokens: 139984896 | elapsed time per iteration (s): 15.20 | learning rate: 2.240E-05 | global batch size: 16 | lm loss: 5.188565E+00 | grad norm: 0.758 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4273/ 128728 | consumed samples: 68368 | consumed tokens: 140017664 | elapsed time per iteration (s): 15.23 | learning rate: 2.240E-05 | global batch size: 16 | lm loss: 5.395569E+00 | grad norm: 0.996 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4274/ 128728 | consumed samples: 68384 | consumed tokens: 140050432 | elapsed time per iteration (s): 15.23 | learning rate: 2.241E-05 | global batch size: 16 | lm loss: 5.257808E+00 | grad norm: 0.684 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4275/ 128728 | consumed samples: 68400 | consumed tokens: 140083200 | elapsed time per iteration (s): 15.23 | learning rate: 2.241E-05 | global batch size: 16 | lm loss: 5.396634E+00 | grad norm: 0.728 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4276/ 128728 | consumed samples: 68416 | consumed tokens: 140115968 | elapsed time per iteration (s): 15.20 | learning rate: 2.242E-05 | global batch size: 16 | lm loss: 5.054380E+00 | grad norm: 0.864 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4277/ 128728 | consumed samples: 68432 | consumed tokens: 140148736 | elapsed time per iteration (s): 15.23 | learning rate: 2.242E-05 | global batch size: 16 | lm loss: 5.394772E+00 | grad norm: 0.821 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4278/ 128728 | consumed samples: 68448 | consumed tokens: 140181504 | elapsed time per iteration (s): 15.19 | learning rate: 2.243E-05 | global batch size: 16 | lm loss: 5.329741E+00 | grad norm: 0.876 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4279/ 128728 | consumed samples: 68464 | consumed tokens: 140214272 | elapsed time per iteration (s): 15.23 | learning rate: 2.243E-05 | global batch size: 16 | lm loss: 5.123846E+00 | grad norm: 0.707 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4280/ 128728 | consumed samples: 68480 | consumed tokens: 140247040 | elapsed time per iteration (s): 15.22 | learning rate: 2.244E-05 | global batch size: 16 | lm loss: 5.114894E+00 | grad norm: 0.762 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4281/ 128728 | consumed samples: 68496 | consumed tokens: 140279808 | elapsed time per iteration (s): 15.24 | learning rate: 2.244E-05 | global batch size: 16 | lm loss: 5.329511E+00 | grad norm: 1.052 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4282/ 128728 | consumed samples: 68512 | consumed tokens: 140312576 | elapsed time per iteration (s): 15.23 | learning rate: 2.245E-05 | global batch size: 16 | lm loss: 5.257269E+00 | grad norm: 0.955 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4283/ 128728 | consumed samples: 68528 | consumed tokens: 140345344 | elapsed time per iteration (s): 15.22 | learning rate: 2.246E-05 | global batch size: 16 | lm loss: 5.189564E+00 | grad norm: 0.929 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4284/ 128728 | consumed samples: 68544 | consumed tokens: 140378112 | elapsed time per iteration (s): 15.19 | learning rate: 2.246E-05 | global batch size: 16 | lm loss: 5.470556E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 4285/ 128728 | consumed samples: 68560 | consumed tokens: 140410880 | elapsed time per iteration (s): 15.21 | learning rate: 2.247E-05 | global batch size: 16 | lm loss: 5.387702E+00 | grad norm: 1.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4286/ 128728 | consumed samples: 68576 | consumed tokens: 140443648 | elapsed time per iteration (s): 15.22 | learning rate: 2.247E-05 | global batch size: 16 | lm loss: 5.492844E+00 | grad norm: 1.051 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4287/ 128728 | consumed samples: 68592 | consumed tokens: 140476416 | elapsed time per iteration (s): 15.22 | learning rate: 2.248E-05 | global batch size: 16 | lm loss: 5.419727E+00 | grad norm: 0.911 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4288/ 128728 | consumed samples: 68608 | consumed tokens: 140509184 | elapsed time per iteration (s): 15.22 | learning rate: 2.248E-05 | global batch size: 16 | lm loss: 5.376180E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4289/ 128728 | consumed samples: 68624 | consumed tokens: 140541952 | elapsed time per iteration (s): 15.21 | learning rate: 2.249E-05 | global batch size: 16 | lm loss: 5.231359E+00 | grad norm: 0.979 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4290/ 128728 | consumed samples: 68640 | consumed tokens: 140574720 | elapsed time per iteration (s): 15.20 | learning rate: 2.249E-05 | global batch size: 16 | lm loss: 5.340625E+00 | grad norm: 0.676 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4291/ 128728 | consumed samples: 68656 | consumed tokens: 140607488 | elapsed time per iteration (s): 15.14 | learning rate: 2.250E-05 | global batch size: 16 | lm loss: 5.693937E+00 | grad norm: 1.466 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 4292/ 128728 | consumed samples: 68672 | consumed tokens: 140640256 | elapsed time per iteration (s): 15.21 | learning rate: 2.250E-05 | global batch size: 16 | lm loss: 5.231561E+00 | grad norm: 0.734 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4293/ 128728 | consumed samples: 68688 | consumed tokens: 140673024 | elapsed time per iteration (s): 15.24 | learning rate: 2.251E-05 | global batch size: 16 | lm loss: 5.350264E+00 | grad norm: 1.706 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4294/ 128728 | consumed samples: 68704 | consumed tokens: 140705792 | elapsed time per iteration (s): 15.21 | learning rate: 2.251E-05 | global batch size: 16 | lm loss: 5.243148E+00 | grad norm: 0.758 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4295/ 128728 | consumed samples: 68720 | consumed tokens: 140738560 | elapsed time per iteration (s): 15.24 | learning rate: 2.252E-05 | global batch size: 16 | lm loss: 5.305950E+00 | grad norm: 0.678 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4296/ 128728 | consumed samples: 68736 | consumed tokens: 140771328 | elapsed time per iteration (s): 15.20 | learning rate: 2.252E-05 | global batch size: 16 | lm loss: 5.412365E+00 | grad norm: 0.786 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4297/ 128728 | consumed samples: 68752 | consumed tokens: 140804096 | elapsed time per iteration (s): 15.21 | learning rate: 2.253E-05 | global batch size: 16 | lm loss: 5.151298E+00 | grad norm: 0.815 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4298/ 128728 | consumed samples: 68768 | consumed tokens: 140836864 | elapsed time per iteration (s): 15.16 | learning rate: 2.253E-05 | global batch size: 16 | lm loss: 5.339790E+00 | grad norm: 0.686 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 4299/ 128728 | consumed samples: 68784 | consumed tokens: 140869632 | elapsed time per iteration (s): 15.22 | learning rate: 2.254E-05 | global batch size: 16 | lm loss: 5.416695E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4300/ 128728 | consumed samples: 68800 | consumed tokens: 140902400 | elapsed time per iteration (s): 15.21 | learning rate: 2.254E-05 | global batch size: 16 | lm loss: 5.202350E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4301/ 128728 | consumed samples: 68816 | consumed tokens: 140935168 | elapsed time per iteration (s): 15.20 | learning rate: 2.255E-05 | global batch size: 16 | lm loss: 5.075429E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4302/ 128728 | consumed samples: 68832 | consumed tokens: 140967936 | elapsed time per iteration (s): 15.18 | learning rate: 2.255E-05 | global batch size: 16 | lm loss: 5.435070E+00 | grad norm: 0.829 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4303/ 128728 | consumed samples: 68848 | consumed tokens: 141000704 | elapsed time per iteration (s): 15.22 | learning rate: 2.256E-05 | global batch size: 16 | lm loss: 5.356237E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4304/ 128728 | consumed samples: 68864 | consumed tokens: 141033472 | elapsed time per iteration (s): 15.23 | learning rate: 2.257E-05 | global batch size: 16 | lm loss: 5.306829E+00 | grad norm: 0.635 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4305/ 128728 | consumed samples: 68880 | consumed tokens: 141066240 | elapsed time per iteration (s): 15.22 | learning rate: 2.257E-05 | global batch size: 16 | lm loss: 5.368681E+00 | grad norm: 1.050 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4306/ 128728 | consumed samples: 68896 | consumed tokens: 141099008 | elapsed time per iteration (s): 15.25 | learning rate: 2.258E-05 | global batch size: 16 | lm loss: 5.147976E+00 | grad norm: 1.436 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.04 | [default7]: iteration 4307/ 128728 | consumed samples: 68912 | consumed tokens: 141131776 | elapsed time per iteration (s): 15.22 | learning rate: 2.258E-05 | global batch size: 16 | lm loss: 5.660544E+00 | grad norm: 1.063 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4308/ 128728 | consumed samples: 68928 | consumed tokens: 141164544 | elapsed time per iteration (s): 15.18 | learning rate: 2.259E-05 | global batch size: 16 | lm loss: 5.237420E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4309/ 128728 | consumed samples: 68944 | consumed tokens: 141197312 | elapsed time per iteration (s): 15.21 | learning rate: 2.259E-05 | global batch size: 16 | lm loss: 5.274828E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4310/ 128728 | consumed samples: 68960 | consumed tokens: 141230080 | elapsed time per iteration (s): 15.22 | learning rate: 2.260E-05 | global batch size: 16 | lm loss: 5.353731E+00 | grad norm: 0.710 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4311/ 128728 | consumed samples: 68976 | consumed tokens: 141262848 | elapsed time per iteration (s): 15.20 | learning rate: 2.260E-05 | global batch size: 16 | lm loss: 5.093883E+00 | grad norm: 0.933 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4312/ 128728 | consumed samples: 68992 | consumed tokens: 141295616 | elapsed time per iteration (s): 15.22 | learning rate: 2.261E-05 | global batch size: 16 | lm loss: 5.287315E+00 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4313/ 128728 | consumed samples: 69008 | consumed tokens: 141328384 | elapsed time per iteration (s): 15.17 | learning rate: 2.261E-05 | global batch size: 16 | lm loss: 5.389223E+00 | grad norm: 0.766 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 4314/ 128728 | consumed samples: 69024 | consumed tokens: 141361152 | elapsed time per iteration (s): 15.20 | learning rate: 2.262E-05 | global batch size: 16 | lm loss: 5.299072E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4315/ 128728 | consumed samples: 69040 | consumed tokens: 141393920 | elapsed time per iteration (s): 15.23 | learning rate: 2.262E-05 | global batch size: 16 | lm loss: 5.455530E+00 | grad norm: 0.810 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4316/ 128728 | consumed samples: 69056 | consumed tokens: 141426688 | elapsed time per iteration (s): 15.24 | learning rate: 2.263E-05 | global batch size: 16 | lm loss: 5.203159E+00 | grad norm: 0.619 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4317/ 128728 | consumed samples: 69072 | consumed tokens: 141459456 | elapsed time per iteration (s): 15.21 | learning rate: 2.263E-05 | global batch size: 16 | lm loss: 5.403166E+00 | grad norm: 0.872 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4318/ 128728 | consumed samples: 69088 | consumed tokens: 141492224 | elapsed time per iteration (s): 15.28 | learning rate: 2.264E-05 | global batch size: 16 | lm loss: 5.332869E+00 | grad norm: 0.726 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 4319/ 128728 | consumed samples: 69104 | consumed tokens: 141524992 | elapsed time per iteration (s): 15.24 | learning rate: 2.264E-05 | global batch size: 16 | lm loss: 5.213949E+00 | grad norm: 1.344 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4320/ 128728 | consumed samples: 69120 | consumed tokens: 141557760 | elapsed time per iteration (s): 15.28 | learning rate: 2.265E-05 | global batch size: 16 | lm loss: 5.245769E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 4321/ 128728 | consumed samples: 69136 | consumed tokens: 141590528 | elapsed time per iteration (s): 15.23 | learning rate: 2.265E-05 | global batch size: 16 | lm loss: 5.118613E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4322/ 128728 | consumed samples: 69152 | consumed tokens: 141623296 | elapsed time per iteration (s): 15.14 | learning rate: 2.266E-05 | global batch size: 16 | lm loss: 5.262797E+00 | grad norm: 0.720 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 4323/ 128728 | consumed samples: 69168 | consumed tokens: 141656064 | elapsed time per iteration (s): 15.15 | learning rate: 2.267E-05 | global batch size: 16 | lm loss: 5.311221E+00 | grad norm: 0.677 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 4324/ 128728 | consumed samples: 69184 | consumed tokens: 141688832 | elapsed time per iteration (s): 15.21 | learning rate: 2.267E-05 | global batch size: 16 | lm loss: 5.268604E+00 | grad norm: 0.783 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4325/ 128728 | consumed samples: 69200 | consumed tokens: 141721600 | elapsed time per iteration (s): 15.20 | learning rate: 2.268E-05 | global batch size: 16 | lm loss: 5.282369E+00 | grad norm: 0.789 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4326/ 128728 | consumed samples: 69216 | consumed tokens: 141754368 | elapsed time per iteration (s): 15.23 | learning rate: 2.268E-05 | global batch size: 16 | lm loss: 5.319102E+00 | grad norm: 0.818 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4327/ 128728 | consumed samples: 69232 | consumed tokens: 141787136 | elapsed time per iteration (s): 15.25 | learning rate: 2.269E-05 | global batch size: 16 | lm loss: 5.178657E+00 | grad norm: 0.922 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4328/ 128728 | consumed samples: 69248 | consumed tokens: 141819904 | elapsed time per iteration (s): 15.22 | learning rate: 2.269E-05 | global batch size: 16 | lm loss: 5.155839E+00 | grad norm: 0.884 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4329/ 128728 | consumed samples: 69264 | consumed tokens: 141852672 | elapsed time per iteration (s): 15.23 | learning rate: 2.270E-05 | global batch size: 16 | lm loss: 5.336724E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4330/ 128728 | consumed samples: 69280 | consumed tokens: 141885440 | elapsed time per iteration (s): 15.26 | learning rate: 2.270E-05 | global batch size: 16 | lm loss: 5.267392E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4331/ 128728 | consumed samples: 69296 | consumed tokens: 141918208 | elapsed time per iteration (s): 15.24 | learning rate: 2.271E-05 | global batch size: 16 | lm loss: 5.210427E+00 | grad norm: 0.890 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4332/ 128728 | consumed samples: 69312 | consumed tokens: 141950976 | elapsed time per iteration (s): 15.20 | learning rate: 2.271E-05 | global batch size: 16 | lm loss: 5.287461E+00 | grad norm: 0.776 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4333/ 128728 | consumed samples: 69328 | consumed tokens: 141983744 | elapsed time per iteration (s): 15.22 | learning rate: 2.272E-05 | global batch size: 16 | lm loss: 5.281710E+00 | grad norm: 0.632 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4334/ 128728 | consumed samples: 69344 | consumed tokens: 142016512 | elapsed time per iteration (s): 15.20 | learning rate: 2.272E-05 | global batch size: 16 | lm loss: 5.312432E+00 | grad norm: 0.765 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4335/ 128728 | consumed samples: 69360 | consumed tokens: 142049280 | elapsed time per iteration (s): 15.22 | learning rate: 2.273E-05 | global batch size: 16 | lm loss: 5.456483E+00 | grad norm: 0.861 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4336/ 128728 | consumed samples: 69376 | consumed tokens: 142082048 | elapsed time per iteration (s): 15.20 | learning rate: 2.273E-05 | global batch size: 16 | lm loss: 4.999959E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4337/ 128728 | consumed samples: 69392 | consumed tokens: 142114816 | elapsed time per iteration (s): 15.25 | learning rate: 2.274E-05 | global batch size: 16 | lm loss: 5.472358E+00 | grad norm: 0.987 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4338/ 128728 | consumed samples: 69408 | consumed tokens: 142147584 | elapsed time per iteration (s): 15.22 | learning rate: 2.274E-05 | global batch size: 16 | lm loss: 5.075167E+00 | grad norm: 1.065 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4339/ 128728 | consumed samples: 69424 | consumed tokens: 142180352 | elapsed time per iteration (s): 15.20 | learning rate: 2.275E-05 | global batch size: 16 | lm loss: 5.377363E+00 | grad norm: 0.643 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4340/ 128728 | consumed samples: 69440 | consumed tokens: 142213120 | elapsed time per iteration (s): 15.23 | learning rate: 2.275E-05 | global batch size: 16 | lm loss: 5.143479E+00 | grad norm: 0.773 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4341/ 128728 | consumed samples: 69456 | consumed tokens: 142245888 | elapsed time per iteration (s): 15.22 | learning rate: 2.276E-05 | global batch size: 16 | lm loss: 5.224275E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4342/ 128728 | consumed samples: 69472 | consumed tokens: 142278656 | elapsed time per iteration (s): 15.21 | learning rate: 2.276E-05 | global batch size: 16 | lm loss: 4.925807E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4343/ 128728 | consumed samples: 69488 | consumed tokens: 142311424 | elapsed time per iteration (s): 15.20 | learning rate: 2.277E-05 | global batch size: 16 | lm loss: 5.460708E+00 | grad norm: 0.648 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4344/ 128728 | consumed samples: 69504 | consumed tokens: 142344192 | elapsed time per iteration (s): 15.23 | learning rate: 2.278E-05 | global batch size: 16 | lm loss: 5.483164E+00 | grad norm: 10.016 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4345/ 128728 | consumed samples: 69520 | consumed tokens: 142376960 | elapsed time per iteration (s): 15.22 | learning rate: 2.278E-05 | global batch size: 16 | lm loss: 5.302545E+00 | grad norm: 0.904 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4346/ 128728 | consumed samples: 69536 | consumed tokens: 142409728 | elapsed time per iteration (s): 15.22 | learning rate: 2.279E-05 | global batch size: 16 | lm loss: 5.222922E+00 | grad norm: 0.775 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4347/ 128728 | consumed samples: 69552 | consumed tokens: 142442496 | elapsed time per iteration (s): 15.22 | learning rate: 2.279E-05 | global batch size: 16 | lm loss: 5.134640E+00 | grad norm: 4.857 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4348/ 128728 | consumed samples: 69568 | consumed tokens: 142475264 | elapsed time per iteration (s): 15.22 | learning rate: 2.280E-05 | global batch size: 16 | lm loss: 5.309505E+00 | grad norm: 0.850 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4349/ 128728 | consumed samples: 69584 | consumed tokens: 142508032 | elapsed time per iteration (s): 15.26 | learning rate: 2.280E-05 | global batch size: 16 | lm loss: 5.236284E+00 | grad norm: 0.928 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4350/ 128728 | consumed samples: 69600 | consumed tokens: 142540800 | elapsed time per iteration (s): 15.24 | learning rate: 2.281E-05 | global batch size: 16 | lm loss: 5.381992E+00 | grad norm: 0.805 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4351/ 128728 | consumed samples: 69616 | consumed tokens: 142573568 | elapsed time per iteration (s): 15.22 | learning rate: 2.281E-05 | global batch size: 16 | lm loss: 5.128081E+00 | grad norm: 0.916 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4352/ 128728 | consumed samples: 69632 | consumed tokens: 142606336 | elapsed time per iteration (s): 15.22 | learning rate: 2.282E-05 | global batch size: 16 | lm loss: 5.119745E+00 | grad norm: 0.899 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4353/ 128728 | consumed samples: 69648 | consumed tokens: 142639104 | elapsed time per iteration (s): 15.21 | learning rate: 2.282E-05 | global batch size: 16 | lm loss: 5.373334E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4354/ 128728 | consumed samples: 69664 | consumed tokens: 142671872 | elapsed time per iteration (s): 15.24 | learning rate: 2.283E-05 | global batch size: 16 | lm loss: 5.252212E+00 | grad norm: 1.092 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4355/ 128728 | consumed samples: 69680 | consumed tokens: 142704640 | elapsed time per iteration (s): 15.23 | learning rate: 2.283E-05 | global batch size: 16 | lm loss: 5.083073E+00 | grad norm: 1.281 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4356/ 128728 | consumed samples: 69696 | consumed tokens: 142737408 | elapsed time per iteration (s): 15.23 | learning rate: 2.284E-05 | global batch size: 16 | lm loss: 5.302938E+00 | grad norm: 0.723 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4357/ 128728 | consumed samples: 69712 | consumed tokens: 142770176 | elapsed time per iteration (s): 15.21 | learning rate: 2.284E-05 | global batch size: 16 | lm loss: 5.174490E+00 | grad norm: 0.668 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4358/ 128728 | consumed samples: 69728 | consumed tokens: 142802944 | elapsed time per iteration (s): 15.20 | learning rate: 2.285E-05 | global batch size: 16 | lm loss: 5.063147E+00 | grad norm: 1.326 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4359/ 128728 | consumed samples: 69744 | consumed tokens: 142835712 | elapsed time per iteration (s): 15.22 | learning rate: 2.285E-05 | global batch size: 16 | lm loss: 5.283265E+00 | grad norm: 0.823 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4360/ 128728 | consumed samples: 69760 | consumed tokens: 142868480 | elapsed time per iteration (s): 15.16 | learning rate: 2.286E-05 | global batch size: 16 | lm loss: 5.007393E+00 | grad norm: 1.027 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 4361/ 128728 | consumed samples: 69776 | consumed tokens: 142901248 | elapsed time per iteration (s): 15.22 | learning rate: 2.286E-05 | global batch size: 16 | lm loss: 5.309522E+00 | grad norm: 0.899 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4362/ 128728 | consumed samples: 69792 | consumed tokens: 142934016 | elapsed time per iteration (s): 15.23 | learning rate: 2.287E-05 | global batch size: 16 | lm loss: 5.322911E+00 | grad norm: 0.778 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4363/ 128728 | consumed samples: 69808 | consumed tokens: 142966784 | elapsed time per iteration (s): 15.22 | learning rate: 2.287E-05 | global batch size: 16 | lm loss: 4.989873E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4364/ 128728 | consumed samples: 69824 | consumed tokens: 142999552 | elapsed time per iteration (s): 15.17 | learning rate: 2.288E-05 | global batch size: 16 | lm loss: 5.066822E+00 | grad norm: 0.903 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 4365/ 128728 | consumed samples: 69840 | consumed tokens: 143032320 | elapsed time per iteration (s): 15.26 | learning rate: 2.289E-05 | global batch size: 16 | lm loss: 5.063513E+00 | grad norm: 0.686 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 4366/ 128728 | consumed samples: 69856 | consumed tokens: 143065088 | elapsed time per iteration (s): 15.24 | learning rate: 2.289E-05 | global batch size: 16 | lm loss: 5.041920E+00 | grad norm: 1.079 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4367/ 128728 | consumed samples: 69872 | consumed tokens: 143097856 | elapsed time per iteration (s): 15.20 | learning rate: 2.290E-05 | global batch size: 16 | lm loss: 5.264553E+00 | grad norm: 0.662 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4368/ 128728 | consumed samples: 69888 | consumed tokens: 143130624 | elapsed time per iteration (s): 15.25 | learning rate: 2.290E-05 | global batch size: 16 | lm loss: 5.502565E+00 | grad norm: 0.898 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4369/ 128728 | consumed samples: 69904 | consumed tokens: 143163392 | elapsed time per iteration (s): 15.21 | learning rate: 2.291E-05 | global batch size: 16 | lm loss: 5.140605E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4370/ 128728 | consumed samples: 69920 | consumed tokens: 143196160 | elapsed time per iteration (s): 15.20 | learning rate: 2.291E-05 | global batch size: 16 | lm loss: 5.729661E+00 | grad norm: 0.705 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4371/ 128728 | consumed samples: 69936 | consumed tokens: 143228928 | elapsed time per iteration (s): 15.26 | learning rate: 2.292E-05 | global batch size: 16 | lm loss: 5.166599E+00 | grad norm: 1.097 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 4372/ 128728 | consumed samples: 69952 | consumed tokens: 143261696 | elapsed time per iteration (s): 15.25 | learning rate: 2.292E-05 | global batch size: 16 | lm loss: 5.305621E+00 | grad norm: 0.777 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4373/ 128728 | consumed samples: 69968 | consumed tokens: 143294464 | elapsed time per iteration (s): 15.14 | learning rate: 2.293E-05 | global batch size: 16 | lm loss: 5.239990E+00 | grad norm: 0.841 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 4374/ 128728 | consumed samples: 69984 | consumed tokens: 143327232 | elapsed time per iteration (s): 15.22 | learning rate: 2.293E-05 | global batch size: 16 | lm loss: 5.457160E+00 | grad norm: 0.782 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4375/ 128728 | consumed samples: 70000 | consumed tokens: 143360000 | elapsed time per iteration (s): 15.22 | learning rate: 2.294E-05 | global batch size: 16 | lm loss: 5.377467E+00 | grad norm: 1.225 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4376/ 128728 | consumed samples: 70016 | consumed tokens: 143392768 | elapsed time per iteration (s): 15.20 | learning rate: 2.294E-05 | global batch size: 16 | lm loss: 5.137845E+00 | grad norm: 0.817 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4377/ 128728 | consumed samples: 70032 | consumed tokens: 143425536 | elapsed time per iteration (s): 15.25 | learning rate: 2.295E-05 | global batch size: 16 | lm loss: 5.104263E+00 | grad norm: 0.872 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4378/ 128728 | consumed samples: 70048 | consumed tokens: 143458304 | elapsed time per iteration (s): 15.20 | learning rate: 2.295E-05 | global batch size: 16 | lm loss: 5.039233E+00 | grad norm: 0.744 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4379/ 128728 | consumed samples: 70064 | consumed tokens: 143491072 | elapsed time per iteration (s): 15.22 | learning rate: 2.296E-05 | global batch size: 16 | lm loss: 5.201389E+00 | grad norm: 1.449 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4380/ 128728 | consumed samples: 70080 | consumed tokens: 143523840 | elapsed time per iteration (s): 15.27 | learning rate: 2.296E-05 | global batch size: 16 | lm loss: 5.206597E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 4381/ 128728 | consumed samples: 70096 | consumed tokens: 143556608 | elapsed time per iteration (s): 15.20 | learning rate: 2.297E-05 | global batch size: 16 | lm loss: 5.222647E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4382/ 128728 | consumed samples: 70112 | consumed tokens: 143589376 | elapsed time per iteration (s): 15.21 | learning rate: 2.297E-05 | global batch size: 16 | lm loss: 4.988690E+00 | grad norm: 0.934 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4383/ 128728 | consumed samples: 70128 | consumed tokens: 143622144 | elapsed time per iteration (s): 15.19 | learning rate: 2.298E-05 | global batch size: 16 | lm loss: 5.317173E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4384/ 128728 | consumed samples: 70144 | consumed tokens: 143654912 | elapsed time per iteration (s): 15.25 | learning rate: 2.298E-05 | global batch size: 16 | lm loss: 5.185295E+00 | grad norm: 0.669 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4385/ 128728 | consumed samples: 70160 | consumed tokens: 143687680 | elapsed time per iteration (s): 15.17 | learning rate: 2.299E-05 | global batch size: 16 | lm loss: 5.370522E+00 | grad norm: 1.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4386/ 128728 | consumed samples: 70176 | consumed tokens: 143720448 | elapsed time per iteration (s): 15.15 | learning rate: 2.300E-05 | global batch size: 16 | lm loss: 5.361063E+00 | grad norm: 0.711 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 4387/ 128728 | consumed samples: 70192 | consumed tokens: 143753216 | elapsed time per iteration (s): 15.26 | learning rate: 2.300E-05 | global batch size: 16 | lm loss: 5.225990E+00 | grad norm: 0.805 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4388/ 128728 | consumed samples: 70208 | consumed tokens: 143785984 | elapsed time per iteration (s): 15.26 | learning rate: 2.301E-05 | global batch size: 16 | lm loss: 5.465258E+00 | grad norm: 0.764 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 4389/ 128728 | consumed samples: 70224 | consumed tokens: 143818752 | elapsed time per iteration (s): 15.20 | learning rate: 2.301E-05 | global batch size: 16 | lm loss: 5.258640E+00 | grad norm: 0.727 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4390/ 128728 | consumed samples: 70240 | consumed tokens: 143851520 | elapsed time per iteration (s): 15.26 | learning rate: 2.302E-05 | global batch size: 16 | lm loss: 5.209820E+00 | grad norm: 0.876 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4391/ 128728 | consumed samples: 70256 | consumed tokens: 143884288 | elapsed time per iteration (s): 15.21 | learning rate: 2.302E-05 | global batch size: 16 | lm loss: 4.884523E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4392/ 128728 | consumed samples: 70272 | consumed tokens: 143917056 | elapsed time per iteration (s): 15.26 | learning rate: 2.303E-05 | global batch size: 16 | lm loss: 5.230456E+00 | grad norm: 1.053 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 4393/ 128728 | consumed samples: 70288 | consumed tokens: 143949824 | elapsed time per iteration (s): 15.22 | learning rate: 2.303E-05 | global batch size: 16 | lm loss: 5.428142E+00 | grad norm: 0.677 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4394/ 128728 | consumed samples: 70304 | consumed tokens: 143982592 | elapsed time per iteration (s): 15.19 | learning rate: 2.304E-05 | global batch size: 16 | lm loss: 5.217700E+00 | grad norm: 0.876 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4395/ 128728 | consumed samples: 70320 | consumed tokens: 144015360 | elapsed time per iteration (s): 15.23 | learning rate: 2.304E-05 | global batch size: 16 | lm loss: 5.157529E+00 | grad norm: 0.919 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4396/ 128728 | consumed samples: 70336 | consumed tokens: 144048128 | elapsed time per iteration (s): 15.25 | learning rate: 2.305E-05 | global batch size: 16 | lm loss: 5.335325E+00 | grad norm: 0.706 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.04 | [default7]: iteration 4397/ 128728 | consumed samples: 70352 | consumed tokens: 144080896 | elapsed time per iteration (s): 15.25 | learning rate: 2.305E-05 | global batch size: 16 | lm loss: 5.159653E+00 | grad norm: 0.687 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4398/ 128728 | consumed samples: 70368 | consumed tokens: 144113664 | elapsed time per iteration (s): 15.24 | learning rate: 2.306E-05 | global batch size: 16 | lm loss: 5.342342E+00 | grad norm: 0.886 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4399/ 128728 | consumed samples: 70384 | consumed tokens: 144146432 | elapsed time per iteration (s): 15.24 | learning rate: 2.306E-05 | global batch size: 16 | lm loss: 5.175276E+00 | grad norm: 0.828 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4400/ 128728 | consumed samples: 70400 | consumed tokens: 144179200 | elapsed time per iteration (s): 15.23 | learning rate: 2.307E-05 | global batch size: 16 | lm loss: 5.433102E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4401/ 128728 | consumed samples: 70416 | consumed tokens: 144211968 | elapsed time per iteration (s): 15.28 | learning rate: 2.307E-05 | global batch size: 16 | lm loss: 5.209073E+00 | grad norm: 1.077 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 4402/ 128728 | consumed samples: 70432 | consumed tokens: 144244736 | elapsed time per iteration (s): 15.24 | learning rate: 2.308E-05 | global batch size: 16 | lm loss: 5.068762E+00 | grad norm: 0.881 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4403/ 128728 | consumed samples: 70448 | consumed tokens: 144277504 | elapsed time per iteration (s): 15.17 | learning rate: 2.308E-05 | global batch size: 16 | lm loss: 5.463076E+00 | grad norm: 0.631 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 4404/ 128728 | consumed samples: 70464 | consumed tokens: 144310272 | elapsed time per iteration (s): 15.24 | learning rate: 2.309E-05 | global batch size: 16 | lm loss: 4.915584E+00 | grad norm: 1.283 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4405/ 128728 | consumed samples: 70480 | consumed tokens: 144343040 | elapsed time per iteration (s): 15.21 | learning rate: 2.309E-05 | global batch size: 16 | lm loss: 5.192725E+00 | grad norm: 0.827 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4406/ 128728 | consumed samples: 70496 | consumed tokens: 144375808 | elapsed time per iteration (s): 15.22 | learning rate: 2.310E-05 | global batch size: 16 | lm loss: 5.232658E+00 | grad norm: 0.647 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4407/ 128728 | consumed samples: 70512 | consumed tokens: 144408576 | elapsed time per iteration (s): 15.23 | learning rate: 2.311E-05 | global batch size: 16 | lm loss: 4.972489E+00 | grad norm: 0.943 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4408/ 128728 | consumed samples: 70528 | consumed tokens: 144441344 | elapsed time per iteration (s): 15.23 | learning rate: 2.311E-05 | global batch size: 16 | lm loss: 5.359754E+00 | grad norm: 0.825 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4409/ 128728 | consumed samples: 70544 | consumed tokens: 144474112 | elapsed time per iteration (s): 15.23 | learning rate: 2.312E-05 | global batch size: 16 | lm loss: 5.230769E+00 | grad norm: 1.020 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4410/ 128728 | consumed samples: 70560 | consumed tokens: 144506880 | elapsed time per iteration (s): 15.21 | learning rate: 2.312E-05 | global batch size: 16 | lm loss: 5.368015E+00 | grad norm: 0.694 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4411/ 128728 | consumed samples: 70576 | consumed tokens: 144539648 | elapsed time per iteration (s): 15.23 | learning rate: 2.313E-05 | global batch size: 16 | lm loss: 5.025774E+00 | grad norm: 0.680 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4412/ 128728 | consumed samples: 70592 | consumed tokens: 144572416 | elapsed time per iteration (s): 15.22 | learning rate: 2.313E-05 | global batch size: 16 | lm loss: 5.240927E+00 | grad norm: 1.049 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4413/ 128728 | consumed samples: 70608 | consumed tokens: 144605184 | elapsed time per iteration (s): 15.25 | learning rate: 2.314E-05 | global batch size: 16 | lm loss: 5.289531E+00 | grad norm: 1.061 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4414/ 128728 | consumed samples: 70624 | consumed tokens: 144637952 | elapsed time per iteration (s): 15.24 | learning rate: 2.314E-05 | global batch size: 16 | lm loss: 5.324119E+00 | grad norm: 0.839 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4415/ 128728 | consumed samples: 70640 | consumed tokens: 144670720 | elapsed time per iteration (s): 15.18 | learning rate: 2.315E-05 | global batch size: 16 | lm loss: 5.208157E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4416/ 128728 | consumed samples: 70656 | consumed tokens: 144703488 | elapsed time per iteration (s): 15.26 | learning rate: 2.315E-05 | global batch size: 16 | lm loss: 5.270568E+00 | grad norm: 1.053 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4417/ 128728 | consumed samples: 70672 | consumed tokens: 144736256 | elapsed time per iteration (s): 15.24 | learning rate: 2.316E-05 | global batch size: 16 | lm loss: 4.967587E+00 | grad norm: 0.928 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4418/ 128728 | consumed samples: 70688 | consumed tokens: 144769024 | elapsed time per iteration (s): 15.26 | learning rate: 2.316E-05 | global batch size: 16 | lm loss: 5.290422E+00 | grad norm: 1.096 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4419/ 128728 | consumed samples: 70704 | consumed tokens: 144801792 | elapsed time per iteration (s): 15.24 | learning rate: 2.317E-05 | global batch size: 16 | lm loss: 5.287684E+00 | grad norm: 0.684 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4420/ 128728 | consumed samples: 70720 | consumed tokens: 144834560 | elapsed time per iteration (s): 15.21 | learning rate: 2.317E-05 | global batch size: 16 | lm loss: 5.091879E+00 | grad norm: 1.074 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4421/ 128728 | consumed samples: 70736 | consumed tokens: 144867328 | elapsed time per iteration (s): 15.20 | learning rate: 2.318E-05 | global batch size: 16 | lm loss: 5.381857E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4422/ 128728 | consumed samples: 70752 | consumed tokens: 144900096 | elapsed time per iteration (s): 15.21 | learning rate: 2.318E-05 | global batch size: 16 | lm loss: 5.311369E+00 | grad norm: 0.677 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4423/ 128728 | consumed samples: 70768 | consumed tokens: 144932864 | elapsed time per iteration (s): 15.24 | learning rate: 2.319E-05 | global batch size: 16 | lm loss: 4.966698E+00 | grad norm: 0.796 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4424/ 128728 | consumed samples: 70784 | consumed tokens: 144965632 | elapsed time per iteration (s): 15.25 | learning rate: 2.319E-05 | global batch size: 16 | lm loss: 5.143426E+00 | grad norm: 0.807 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4425/ 128728 | consumed samples: 70800 | consumed tokens: 144998400 | elapsed time per iteration (s): 15.25 | learning rate: 2.320E-05 | global batch size: 16 | lm loss: 4.974629E+00 | grad norm: 1.213 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4426/ 128728 | consumed samples: 70816 | consumed tokens: 145031168 | elapsed time per iteration (s): 15.23 | learning rate: 2.321E-05 | global batch size: 16 | lm loss: 5.348171E+00 | grad norm: 0.979 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4427/ 128728 | consumed samples: 70832 | consumed tokens: 145063936 | elapsed time per iteration (s): 15.19 | learning rate: 2.321E-05 | global batch size: 16 | lm loss: 5.526504E+00 | grad norm: 0.778 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4428/ 128728 | consumed samples: 70848 | consumed tokens: 145096704 | elapsed time per iteration (s): 15.18 | learning rate: 2.322E-05 | global batch size: 16 | lm loss: 5.218261E+00 | grad norm: 0.666 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4429/ 128728 | consumed samples: 70864 | consumed tokens: 145129472 | elapsed time per iteration (s): 15.26 | learning rate: 2.322E-05 | global batch size: 16 | lm loss: 5.136736E+00 | grad norm: 0.777 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4430/ 128728 | consumed samples: 70880 | consumed tokens: 145162240 | elapsed time per iteration (s): 15.20 | learning rate: 2.323E-05 | global batch size: 16 | lm loss: 5.037167E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4431/ 128728 | consumed samples: 70896 | consumed tokens: 145195008 | elapsed time per iteration (s): 15.24 | learning rate: 2.323E-05 | global batch size: 16 | lm loss: 5.275063E+00 | grad norm: 1.014 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4432/ 128728 | consumed samples: 70912 | consumed tokens: 145227776 | elapsed time per iteration (s): 15.23 | learning rate: 2.324E-05 | global batch size: 16 | lm loss: 5.226987E+00 | grad norm: 0.744 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4433/ 128728 | consumed samples: 70928 | consumed tokens: 145260544 | elapsed time per iteration (s): 15.22 | learning rate: 2.324E-05 | global batch size: 16 | lm loss: 5.212551E+00 | grad norm: 0.743 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4434/ 128728 | consumed samples: 70944 | consumed tokens: 145293312 | elapsed time per iteration (s): 15.24 | learning rate: 2.325E-05 | global batch size: 16 | lm loss: 5.190849E+00 | grad norm: 0.802 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4435/ 128728 | consumed samples: 70960 | consumed tokens: 145326080 | elapsed time per iteration (s): 15.25 | learning rate: 2.325E-05 | global batch size: 16 | lm loss: 5.322753E+00 | grad norm: 0.785 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4436/ 128728 | consumed samples: 70976 | consumed tokens: 145358848 | elapsed time per iteration (s): 15.23 | learning rate: 2.326E-05 | global batch size: 16 | lm loss: 5.214334E+00 | grad norm: 0.829 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4437/ 128728 | consumed samples: 70992 | consumed tokens: 145391616 | elapsed time per iteration (s): 15.20 | learning rate: 2.326E-05 | global batch size: 16 | lm loss: 5.383008E+00 | grad norm: 0.827 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4438/ 128728 | consumed samples: 71008 | consumed tokens: 145424384 | elapsed time per iteration (s): 15.27 | learning rate: 2.327E-05 | global batch size: 16 | lm loss: 5.295764E+00 | grad norm: 0.820 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 4439/ 128728 | consumed samples: 71024 | consumed tokens: 145457152 | elapsed time per iteration (s): 15.23 | learning rate: 2.327E-05 | global batch size: 16 | lm loss: 5.206472E+00 | grad norm: 1.348 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4440/ 128728 | consumed samples: 71040 | consumed tokens: 145489920 | elapsed time per iteration (s): 15.25 | learning rate: 2.328E-05 | global batch size: 16 | lm loss: 5.287164E+00 | grad norm: 1.248 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4441/ 128728 | consumed samples: 71056 | consumed tokens: 145522688 | elapsed time per iteration (s): 15.27 | learning rate: 2.328E-05 | global batch size: 16 | lm loss: 5.455941E+00 | grad norm: 2.619 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 4442/ 128728 | consumed samples: 71072 | consumed tokens: 145555456 | elapsed time per iteration (s): 15.21 | learning rate: 2.329E-05 | global batch size: 16 | lm loss: 5.322211E+00 | grad norm: 0.684 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4443/ 128728 | consumed samples: 71088 | consumed tokens: 145588224 | elapsed time per iteration (s): 15.27 | learning rate: 2.329E-05 | global batch size: 16 | lm loss: 4.969383E+00 | grad norm: 1.570 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 4444/ 128728 | consumed samples: 71104 | consumed tokens: 145620992 | elapsed time per iteration (s): 15.18 | learning rate: 2.330E-05 | global batch size: 16 | lm loss: 5.289163E+00 | grad norm: 0.694 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4445/ 128728 | consumed samples: 71120 | consumed tokens: 145653760 | elapsed time per iteration (s): 15.23 | learning rate: 2.330E-05 | global batch size: 16 | lm loss: 5.375591E+00 | grad norm: 1.066 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4446/ 128728 | consumed samples: 71136 | consumed tokens: 145686528 | elapsed time per iteration (s): 15.22 | learning rate: 2.331E-05 | global batch size: 16 | lm loss: 5.250404E+00 | grad norm: 1.054 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4447/ 128728 | consumed samples: 71152 | consumed tokens: 145719296 | elapsed time per iteration (s): 15.21 | learning rate: 2.332E-05 | global batch size: 16 | lm loss: 5.128370E+00 | grad norm: 1.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4448/ 128728 | consumed samples: 71168 | consumed tokens: 145752064 | elapsed time per iteration (s): 15.27 | learning rate: 2.332E-05 | global batch size: 16 | lm loss: 5.044110E+00 | grad norm: 1.051 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 4449/ 128728 | consumed samples: 71184 | consumed tokens: 145784832 | elapsed time per iteration (s): 15.20 | learning rate: 2.333E-05 | global batch size: 16 | lm loss: 5.220716E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4450/ 128728 | consumed samples: 71200 | consumed tokens: 145817600 | elapsed time per iteration (s): 15.23 | learning rate: 2.333E-05 | global batch size: 16 | lm loss: 5.262385E+00 | grad norm: 1.700 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4451/ 128728 | consumed samples: 71216 | consumed tokens: 145850368 | elapsed time per iteration (s): 15.30 | learning rate: 2.334E-05 | global batch size: 16 | lm loss: 5.220573E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.045 | TFLOPs: 8.00 | [default7]: iteration 4452/ 128728 | consumed samples: 71232 | consumed tokens: 145883136 | elapsed time per iteration (s): 15.24 | learning rate: 2.334E-05 | global batch size: 16 | lm loss: 5.203768E+00 | grad norm: 0.817 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4453/ 128728 | consumed samples: 71248 | consumed tokens: 145915904 | elapsed time per iteration (s): 15.18 | learning rate: 2.335E-05 | global batch size: 16 | lm loss: 5.269801E+00 | grad norm: 0.649 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4454/ 128728 | consumed samples: 71264 | consumed tokens: 145948672 | elapsed time per iteration (s): 15.20 | learning rate: 2.335E-05 | global batch size: 16 | lm loss: 5.050187E+00 | grad norm: 0.665 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4455/ 128728 | consumed samples: 71280 | consumed tokens: 145981440 | elapsed time per iteration (s): 15.20 | learning rate: 2.336E-05 | global batch size: 16 | lm loss: 5.117810E+00 | grad norm: 0.728 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4456/ 128728 | consumed samples: 71296 | consumed tokens: 146014208 | elapsed time per iteration (s): 15.23 | learning rate: 2.336E-05 | global batch size: 16 | lm loss: 5.348668E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4457/ 128728 | consumed samples: 71312 | consumed tokens: 146046976 | elapsed time per iteration (s): 15.20 | learning rate: 2.337E-05 | global batch size: 16 | lm loss: 5.208596E+00 | grad norm: 0.874 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4458/ 128728 | consumed samples: 71328 | consumed tokens: 146079744 | elapsed time per iteration (s): 15.21 | learning rate: 2.337E-05 | global batch size: 16 | lm loss: 5.182173E+00 | grad norm: 0.736 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4459/ 128728 | consumed samples: 71344 | consumed tokens: 146112512 | elapsed time per iteration (s): 15.20 | learning rate: 2.338E-05 | global batch size: 16 | lm loss: 5.112963E+00 | grad norm: 0.885 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4460/ 128728 | consumed samples: 71360 | consumed tokens: 146145280 | elapsed time per iteration (s): 15.18 | learning rate: 2.338E-05 | global batch size: 16 | lm loss: 5.317193E+00 | grad norm: 0.671 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4461/ 128728 | consumed samples: 71376 | consumed tokens: 146178048 | elapsed time per iteration (s): 15.22 | learning rate: 2.339E-05 | global batch size: 16 | lm loss: 5.396627E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4462/ 128728 | consumed samples: 71392 | consumed tokens: 146210816 | elapsed time per iteration (s): 15.24 | learning rate: 2.339E-05 | global batch size: 16 | lm loss: 5.153749E+00 | grad norm: 0.900 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4463/ 128728 | consumed samples: 71408 | consumed tokens: 146243584 | elapsed time per iteration (s): 15.20 | learning rate: 2.340E-05 | global batch size: 16 | lm loss: 4.955423E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4464/ 128728 | consumed samples: 71424 | consumed tokens: 146276352 | elapsed time per iteration (s): 15.20 | learning rate: 2.340E-05 | global batch size: 16 | lm loss: 5.183180E+00 | grad norm: 0.876 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4465/ 128728 | consumed samples: 71440 | consumed tokens: 146309120 | elapsed time per iteration (s): 15.23 | learning rate: 2.341E-05 | global batch size: 16 | lm loss: 5.210735E+00 | grad norm: 0.758 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4466/ 128728 | consumed samples: 71456 | consumed tokens: 146341888 | elapsed time per iteration (s): 15.19 | learning rate: 2.341E-05 | global batch size: 16 | lm loss: 5.146707E+00 | grad norm: 0.685 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4467/ 128728 | consumed samples: 71472 | consumed tokens: 146374656 | elapsed time per iteration (s): 15.21 | learning rate: 2.342E-05 | global batch size: 16 | lm loss: 5.419012E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4468/ 128728 | consumed samples: 71488 | consumed tokens: 146407424 | elapsed time per iteration (s): 15.23 | learning rate: 2.343E-05 | global batch size: 16 | lm loss: 4.935767E+00 | grad norm: 0.968 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4469/ 128728 | consumed samples: 71504 | consumed tokens: 146440192 | elapsed time per iteration (s): 15.22 | learning rate: 2.343E-05 | global batch size: 16 | lm loss: 5.208894E+00 | grad norm: 1.019 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4470/ 128728 | consumed samples: 71520 | consumed tokens: 146472960 | elapsed time per iteration (s): 15.31 | learning rate: 2.344E-05 | global batch size: 16 | lm loss: 5.157829E+00 | grad norm: 0.750 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.045 | TFLOPs: 8.00 | [default7]: iteration 4471/ 128728 | consumed samples: 71536 | consumed tokens: 146505728 | elapsed time per iteration (s): 15.20 | learning rate: 2.344E-05 | global batch size: 16 | lm loss: 5.193877E+00 | grad norm: 0.880 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4472/ 128728 | consumed samples: 71552 | consumed tokens: 146538496 | elapsed time per iteration (s): 15.18 | learning rate: 2.345E-05 | global batch size: 16 | lm loss: 5.028915E+00 | grad norm: 1.691 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4473/ 128728 | consumed samples: 71568 | consumed tokens: 146571264 | elapsed time per iteration (s): 15.20 | learning rate: 2.345E-05 | global batch size: 16 | lm loss: 5.195506E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4474/ 128728 | consumed samples: 71584 | consumed tokens: 146604032 | elapsed time per iteration (s): 15.20 | learning rate: 2.346E-05 | global batch size: 16 | lm loss: 5.137490E+00 | grad norm: 0.848 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4475/ 128728 | consumed samples: 71600 | consumed tokens: 146636800 | elapsed time per iteration (s): 15.21 | learning rate: 2.346E-05 | global batch size: 16 | lm loss: 5.193035E+00 | grad norm: 0.957 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4476/ 128728 | consumed samples: 71616 | consumed tokens: 146669568 | elapsed time per iteration (s): 15.22 | learning rate: 2.347E-05 | global batch size: 16 | lm loss: 5.103290E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4477/ 128728 | consumed samples: 71632 | consumed tokens: 146702336 | elapsed time per iteration (s): 15.28 | learning rate: 2.347E-05 | global batch size: 16 | lm loss: 5.285797E+00 | grad norm: 1.021 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 4478/ 128728 | consumed samples: 71648 | consumed tokens: 146735104 | elapsed time per iteration (s): 15.21 | learning rate: 2.348E-05 | global batch size: 16 | lm loss: 5.268842E+00 | grad norm: 0.750 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4479/ 128728 | consumed samples: 71664 | consumed tokens: 146767872 | elapsed time per iteration (s): 15.21 | learning rate: 2.348E-05 | global batch size: 16 | lm loss: 5.037316E+00 | grad norm: 0.663 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4480/ 128728 | consumed samples: 71680 | consumed tokens: 146800640 | elapsed time per iteration (s): 15.23 | learning rate: 2.349E-05 | global batch size: 16 | lm loss: 5.297882E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4481/ 128728 | consumed samples: 71696 | consumed tokens: 146833408 | elapsed time per iteration (s): 15.21 | learning rate: 2.349E-05 | global batch size: 16 | lm loss: 5.265193E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4482/ 128728 | consumed samples: 71712 | consumed tokens: 146866176 | elapsed time per iteration (s): 15.23 | learning rate: 2.350E-05 | global batch size: 16 | lm loss: 5.264235E+00 | grad norm: 0.691 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4483/ 128728 | consumed samples: 71728 | consumed tokens: 146898944 | elapsed time per iteration (s): 15.22 | learning rate: 2.350E-05 | global batch size: 16 | lm loss: 5.150328E+00 | grad norm: 1.005 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4484/ 128728 | consumed samples: 71744 | consumed tokens: 146931712 | elapsed time per iteration (s): 15.21 | learning rate: 2.351E-05 | global batch size: 16 | lm loss: 5.206596E+00 | grad norm: 0.868 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4485/ 128728 | consumed samples: 71760 | consumed tokens: 146964480 | elapsed time per iteration (s): 15.20 | learning rate: 2.351E-05 | global batch size: 16 | lm loss: 5.173460E+00 | grad norm: 0.939 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4486/ 128728 | consumed samples: 71776 | consumed tokens: 146997248 | elapsed time per iteration (s): 15.23 | learning rate: 2.352E-05 | global batch size: 16 | lm loss: 5.111909E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4487/ 128728 | consumed samples: 71792 | consumed tokens: 147030016 | elapsed time per iteration (s): 15.21 | learning rate: 2.352E-05 | global batch size: 16 | lm loss: 5.359019E+00 | grad norm: 0.664 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4488/ 128728 | consumed samples: 71808 | consumed tokens: 147062784 | elapsed time per iteration (s): 15.21 | learning rate: 2.353E-05 | global batch size: 16 | lm loss: 5.182050E+00 | grad norm: 0.773 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4489/ 128728 | consumed samples: 71824 | consumed tokens: 147095552 | elapsed time per iteration (s): 15.20 | learning rate: 2.354E-05 | global batch size: 16 | lm loss: 5.222503E+00 | grad norm: 2.350 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4490/ 128728 | consumed samples: 71840 | consumed tokens: 147128320 | elapsed time per iteration (s): 15.24 | learning rate: 2.354E-05 | global batch size: 16 | lm loss: 5.156126E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4491/ 128728 | consumed samples: 71856 | consumed tokens: 147161088 | elapsed time per iteration (s): 15.27 | learning rate: 2.355E-05 | global batch size: 16 | lm loss: 5.105990E+00 | grad norm: 0.811 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 4492/ 128728 | consumed samples: 71872 | consumed tokens: 147193856 | elapsed time per iteration (s): 15.24 | learning rate: 2.355E-05 | global batch size: 16 | lm loss: 5.223593E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4493/ 128728 | consumed samples: 71888 | consumed tokens: 147226624 | elapsed time per iteration (s): 15.22 | learning rate: 2.356E-05 | global batch size: 16 | lm loss: 5.256061E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4494/ 128728 | consumed samples: 71904 | consumed tokens: 147259392 | elapsed time per iteration (s): 15.23 | learning rate: 2.356E-05 | global batch size: 16 | lm loss: 5.453145E+00 | grad norm: 0.739 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4495/ 128728 | consumed samples: 71920 | consumed tokens: 147292160 | elapsed time per iteration (s): 15.15 | learning rate: 2.357E-05 | global batch size: 16 | lm loss: 5.212368E+00 | grad norm: 0.979 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.09 | [default7]: iteration 4496/ 128728 | consumed samples: 71936 | consumed tokens: 147324928 | elapsed time per iteration (s): 15.21 | learning rate: 2.357E-05 | global batch size: 16 | lm loss: 4.998689E+00 | grad norm: 0.793 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4497/ 128728 | consumed samples: 71952 | consumed tokens: 147357696 | elapsed time per iteration (s): 15.22 | learning rate: 2.358E-05 | global batch size: 16 | lm loss: 5.206701E+00 | grad norm: 1.081 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4498/ 128728 | consumed samples: 71968 | consumed tokens: 147390464 | elapsed time per iteration (s): 15.25 | learning rate: 2.358E-05 | global batch size: 16 | lm loss: 5.252453E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4499/ 128728 | consumed samples: 71984 | consumed tokens: 147423232 | elapsed time per iteration (s): 15.21 | learning rate: 2.359E-05 | global batch size: 16 | lm loss: 5.190476E+00 | grad norm: 0.689 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4500/ 128728 | consumed samples: 72000 | consumed tokens: 147456000 | elapsed time per iteration (s): 15.22 | learning rate: 2.359E-05 | global batch size: 16 | lm loss: 5.268880E+00 | grad norm: 0.656 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default0]:saving checkpoint at iteration 4500 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default1]:[2022-03-04 01:02:52,316] [INFO] [logging.py:69:log_dist] [Rank 1] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/mp_rank_01_model_states.pt [default0]:[2022-03-04 01:02:52,625] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/mp_rank_00_model_states.pt [default1]:[2022-03-04 01:03:04,623] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_33_optim_states.pt [default5]:[2022-03-04 01:03:04,936] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_41_optim_states.pt [default3]:[2022-03-04 01:03:05,017] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_43_optim_states.pt [default1]:[2022-03-04 01:03:05,300] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_17_optim_states.pt [default7]:[2022-03-04 01:03:05,280] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_35_optim_states.pt [default0]:[2022-03-04 01:03:05,304] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_32_optim_states.pt [default2]:[2022-03-04 01:03:05,353] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_30_optim_states.pt [default6]:[2022-03-04 01:03:05,446] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_34_optim_states.pt [default7]:[2022-03-04 01:03:05,439] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_19_optim_states.pt [default6]:[2022-03-04 01:03:05,491] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_18_optim_states.pt [default0]:[2022-03-04 01:03:05,586] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_28_optim_states.pt [default2]:[2022-03-04 01:03:05,604] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_34_optim_states.pt [default7]:[2022-03-04 01:03:05,718] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_43_optim_states.pt [default6]:[2022-03-04 01:03:05,772] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_42_optim_states.pt [default5]:[2022-03-04 01:03:05,778] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_33_optim_states.pt [default4]:[2022-03-04 01:03:05,865] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_40_optim_states.pt [default2]:[2022-03-04 01:03:05,788] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_42_optim_states.pt [default3]:[2022-03-04 01:03:05,873] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_19_optim_states.pt [default3]:[2022-03-04 01:03:05,892] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_35_optim_states.pt [default5]:[2022-03-04 01:03:05,952] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_17_optim_states.pt [default4]:[2022-03-04 01:03:06,051] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_16_optim_states.pt [default2]:[2022-03-04 01:03:06,093] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_18_optim_states.pt [default0]:[2022-03-04 01:03:06,121] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_16_optim_states.pt [default1]:[2022-03-04 01:03:06,187] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_41_optim_states.pt [default0]:[2022-03-04 01:03:06,227] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_40_optim_states.pt [default4]:[2022-03-04 01:03:06,384] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_32_optim_states.pt [default0]:[2022-03-04 01:03:06,511] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_08_optim_states.pt [default5]:[2022-03-04 01:03:06,839] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_09_optim_states.pt [default3]:[2022-03-04 01:03:06,852] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_31_optim_states.pt [default1]:[2022-03-04 01:03:06,816] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_29_optim_states.pt [default4]:[2022-03-04 01:03:06,829] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_08_optim_states.pt [default6]:[2022-03-04 01:03:06,850] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_30_optim_states.pt [default4]:[2022-03-04 01:03:06,978] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_12_optim_states.pt [default7]:[2022-03-04 01:03:07,004] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_31_optim_states.pt [default5]:[2022-03-04 01:03:07,009] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_29_optim_states.pt [default4]:[2022-03-04 01:03:07,039] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_28_optim_states.pt [default6]:[2022-03-04 01:03:07,128] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_14_optim_states.pt [default7]:[2022-03-04 01:03:07,116] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_15_optim_states.pt [default0]:[2022-03-04 01:03:07,330] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_28_optim_states.pt [default5]:[2022-03-04 01:03:07,393] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_21_optim_states.pt [default2]:[2022-03-04 01:03:07,338] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_30_optim_states.pt [default0]:[2022-03-04 01:03:07,412] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt [default4]:[2022-03-04 01:03:07,447] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_44_optim_states.pt [default6]:[2022-03-04 01:03:07,457] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_10_optim_states.pt [default1]:[2022-03-04 01:03:07,404] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_37_optim_states.pt [default7]:[2022-03-04 01:03:07,410] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_11_optim_states.pt [default2]:[2022-03-04 01:03:07,556] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_14_optim_states.pt [default4]:[2022-03-04 01:03:07,552] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_28_optim_states.pt [default5]:[2022-03-04 01:03:07,552] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_29_optim_states.pt [default6]:[2022-03-04 01:03:07,638] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_26_optim_states.pt [default0]:[2022-03-04 01:03:07,650] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_20_optim_states.pt [default3]:[2022-03-04 01:03:07,757] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_31_optim_states.pt [default2]:[2022-03-04 01:03:07,717] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_30_optim_states.pt [default4]:[2022-03-04 01:03:07,827] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_20_optim_states.pt [default5]:[2022-03-04 01:03:07,826] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_13_optim_states.pt [default2]:[2022-03-04 01:03:07,876] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_14_optim_states.pt [default0]:[2022-03-04 01:03:07,906] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_12_optim_states.pt [default1]:[2022-03-04 01:03:07,921] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_29_optim_states.pt [default1]:[2022-03-04 01:03:07,954] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_25_optim_states.pt [default2]:[2022-03-04 01:03:07,978] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_22_optim_states.pt [default7]:[2022-03-04 01:03:07,994] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_31_optim_states.pt [default0]:[2022-03-04 01:03:07,980] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_24_optim_states.pt [default3]:[2022-03-04 01:03:08,033] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_31_optim_states.pt [default3]:[2022-03-04 01:03:08,060] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_15_optim_states.pt [default0]:[2022-03-04 01:03:08,100] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_44_optim_states.pt [default0]:[2022-03-04 01:03:08,122] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_12_optim_states.pt [default6]:[2022-03-04 01:03:08,221] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_38_optim_states.pt [default5]:[2022-03-04 01:03:08,237] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_37_optim_states.pt [default1]:[2022-03-04 01:03:08,227] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_21_optim_states.pt [default0]:[2022-03-04 01:03:08,176] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_36_optim_states.pt [default7]:[2022-03-04 01:03:08,299] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_39_optim_states.pt [default1]:[2022-03-04 01:03:08,267] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_45_optim_states.pt [default5]:[2022-03-04 01:03:08,283] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_33_optim_states.pt [default3]:[2022-03-04 01:03:08,284] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_27_optim_states.pt [default4]:[2022-03-04 01:03:08,322] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_36_optim_states.pt [default1]:[2022-03-04 01:03:08,354] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_13_optim_states.pt [default6]:[2022-03-04 01:03:08,408] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_46_optim_states.pt [default7]:[2022-03-04 01:03:08,439] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_15_optim_states.pt [default1]:[2022-03-04 01:03:08,441] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_01_optim_states.pt [default5]:[2022-03-04 01:03:08,444] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_45_optim_states.pt [default6]:[2022-03-04 01:03:08,431] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_30_optim_states.pt [default1]:[2022-03-04 01:03:08,477] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_09_optim_states.pt [default2]:[2022-03-04 01:03:08,506] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_38_optim_states.pt [default7]:[2022-03-04 01:03:08,477] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_47_optim_states.pt [default3]:[2022-03-04 01:03:08,513] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_23_optim_states.pt [default6]:[2022-03-04 01:03:08,525] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_34_optim_states.pt [default6]:[2022-03-04 01:03:08,536] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_14_optim_states.pt [default3]:[2022-03-04 01:03:08,631] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_35_optim_states.pt [default1]:[2022-03-04 01:03:08,617] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_13_optim_states.pt [default7]:[2022-03-04 01:03:08,630] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_23_optim_states.pt [default2]:[2022-03-04 01:03:08,655] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_46_optim_states.pt [default6]:[2022-03-04 01:03:08,661] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_22_optim_states.pt [default7]:[2022-03-04 01:03:08,659] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_27_optim_states.pt [default3]:[2022-03-04 01:03:08,660] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_39_optim_states.pt [default1]:[2022-03-04 01:03:08,756] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_29_optim_states.pt [default5]:[2022-03-04 01:03:08,772] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_25_optim_states.pt [default0]:[2022-03-04 01:03:08,768] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_28_optim_states.pt [default4]:[2022-03-04 01:03:08,837] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_12_optim_states.pt [default3]:[2022-03-04 01:03:08,843] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_47_optim_states.pt [default3]:[2022-03-04 01:03:08,853] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_15_optim_states.pt [default2]:[2022-03-04 01:03:08,883] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_42_optim_states.pt [default0]:[2022-03-04 01:03:08,959] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_32_optim_states.pt [default2]:[2022-03-04 01:03:08,979] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_26_optim_states.pt [default5]:[2022-03-04 01:03:08,933] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_13_optim_states.pt [default5]:[2022-03-04 01:03:09,067] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_45_optim_states.pt [default2]:[2022-03-04 01:03:09,143] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_34_optim_states.pt [default4]:[2022-03-04 01:03:09,195] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_24_optim_states.pt [default7]:[2022-03-04 01:03:09,235] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_35_optim_states.pt [default4]:[2022-03-04 01:03:09,281] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt [default1]:[2022-03-04 01:03:09,363] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_09_optim_states.pt [default4]:[2022-03-04 01:03:09,347] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_32_optim_states.pt [default0]:[2022-03-04 01:03:09,356] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_08_optim_states.pt [default0]:[2022-03-04 01:03:09,428] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_20_optim_states.pt [default1]:[2022-03-04 01:03:09,416] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_17_optim_states.pt [default1]:[2022-03-04 01:03:09,542] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_33_optim_states.pt [default3]:[2022-03-04 01:03:09,567] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_11_optim_states.pt [default0]:[2022-03-04 01:03:09,860] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_16_optim_states.pt [default5]:[2022-03-04 01:03:09,891] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_41_optim_states.pt [default5]:[2022-03-04 01:03:09,886] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_29_optim_states.pt [default2]:[2022-03-04 01:03:09,927] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_10_optim_states.pt [default2]:[2022-03-04 01:03:10,000] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_02_optim_states.pt [default1]:[2022-03-04 01:03:10,133] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_41_optim_states.pt [default5]:[2022-03-04 01:03:10,126] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_01_optim_states.pt [default3]:[2022-03-04 01:03:10,297] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_43_optim_states.pt [default2]:[2022-03-04 01:03:10,352] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_38_optim_states.pt [default7]:[2022-03-04 01:03:10,319] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_31_optim_states.pt [default4]:[2022-03-04 01:03:10,295] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_40_optim_states.pt [default5]:[2022-03-04 01:03:10,402] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_37_optim_states.pt [default1]:[2022-03-04 01:03:10,371] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_05_optim_states.pt [default6]:[2022-03-04 01:03:10,476] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_30_optim_states.pt [default5]:[2022-03-04 01:03:10,449] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_01_optim_states.pt [default2]:[2022-03-04 01:03:10,501] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_22_optim_states.pt [default6]:[2022-03-04 01:03:10,535] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_10_optim_states.pt [default3]:[2022-03-04 01:03:10,588] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_03_optim_states.pt [default4]:[2022-03-04 01:03:10,554] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_28_optim_states.pt [default7]:[2022-03-04 01:03:10,606] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_11_optim_states.pt [default3]:[2022-03-04 01:03:10,576] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_03_optim_states.pt [default0]:[2022-03-04 01:03:10,658] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_20_optim_states.pt [default4]:[2022-03-04 01:03:10,703] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt [default5]:[2022-03-04 01:03:10,784] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt [default3]:[2022-03-04 01:03:10,777] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_31_optim_states.pt [default2]:[2022-03-04 01:03:10,726] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_06_optim_states.pt [default4]:[2022-03-04 01:03:10,775] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_20_optim_states.pt [default0]:[2022-03-04 01:03:10,853] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_40_optim_states.pt [default1]:[2022-03-04 01:03:10,808] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_21_optim_states.pt [default3]:[2022-03-04 01:03:10,864] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_07_optim_states.pt [default4]:[2022-03-04 01:03:10,943] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt [default5]:[2022-03-04 01:03:10,985] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_09_optim_states.pt [default5]:[2022-03-04 01:03:11,001] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_05_optim_states.pt [default2]:[2022-03-04 01:03:11,107] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_06_optim_states.pt [default7]:[2022-03-04 01:03:11,128] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_23_optim_states.pt [default3]:[2022-03-04 01:03:11,147] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_23_optim_states.pt [default4]:[2022-03-04 01:03:11,186] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_44_optim_states.pt [default7]:[2022-03-04 01:03:11,192] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_47_optim_states.pt [default1]:[2022-03-04 01:03:11,199] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_45_optim_states.pt [default6]:[2022-03-04 01:03:11,218] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_42_optim_states.pt [default6]:[2022-03-04 01:03:11,284] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_46_optim_states.pt [default0]:[2022-03-04 01:03:11,363] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_44_optim_states.pt [default2]:[2022-03-04 01:03:11,327] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_10_optim_states.pt [default6]:[2022-03-04 01:03:11,411] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_38_optim_states.pt [default7]:[2022-03-04 01:03:11,438] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_43_optim_states.pt [default3]:[2022-03-04 01:03:11,525] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_39_optim_states.pt [default2]:[2022-03-04 01:03:11,514] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_22_optim_states.pt [default4]:[2022-03-04 01:03:11,527] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_36_optim_states.pt [default4]:[2022-03-04 01:03:11,577] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_08_optim_states.pt [default3]:[2022-03-04 01:03:11,608] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_11_optim_states.pt [default7]:[2022-03-04 01:03:11,753] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_39_optim_states.pt [default2]:[2022-03-04 01:03:11,764] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_02_optim_states.pt [default6]:[2022-03-04 01:03:11,878] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_22_optim_states.pt [default3]:[2022-03-04 01:03:11,902] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_23_optim_states.pt [default5]:[2022-03-04 01:03:11,836] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_13_optim_states.pt [default0]:[2022-03-04 01:03:11,861] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt [default4]:[2022-03-04 01:03:11,956] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_04_optim_states.pt [default0]:[2022-03-04 01:03:11,990] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_12_optim_states.pt [default6]:[2022-03-04 01:03:12,033] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_18_optim_states.pt [default5]:[2022-03-04 01:03:12,017] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_05_optim_states.pt [default7]:[2022-03-04 01:03:12,108] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_19_optim_states.pt [default6]:[2022-03-04 01:03:12,051] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_10_optim_states.pt [default0]:[2022-03-04 01:03:12,174] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_36_optim_states.pt [default1]:[2022-03-04 01:03:12,204] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_05_optim_states.pt [default5]:[2022-03-04 01:03:12,262] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_17_optim_states.pt [default1]:[2022-03-04 01:03:12,186] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_37_optim_states.pt [default5]:[2022-03-04 01:03:12,269] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_21_optim_states.pt [default4]:[2022-03-04 01:03:12,427] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_16_optim_states.pt [default1]:[2022-03-04 01:03:12,446] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_21_optim_states.pt [default1]:[2022-03-04 01:03:12,501] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_01_optim_states.pt [default6]:[2022-03-04 01:03:12,663] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_02_optim_states.pt [default4]:[2022-03-04 01:03:12,667] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_08_optim_states.pt [default4]:[2022-03-04 01:03:12,734] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_16_optim_states.pt [default4]:[2022-03-04 01:03:12,847] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_20_optim_states.pt [default7]:[2022-03-04 01:03:12,917] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_15_optim_states.pt [default5]:[2022-03-04 01:03:13,053] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_45_optim_states.pt [default6]:[2022-03-04 01:03:13,011] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_14_optim_states.pt [default4]:[2022-03-04 01:03:12,984] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_12_optim_states.pt [default3]:[2022-03-04 01:03:13,079] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_07_optim_states.pt [default5]:[2022-03-04 01:03:13,060] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_21_optim_states.pt [default1]:[2022-03-04 01:03:13,053] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_13_optim_states.pt [default7]:[2022-03-04 01:03:13,157] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_11_optim_states.pt [default2]:[2022-03-04 01:03:13,169] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_10_optim_states.pt [default6]:[2022-03-04 01:03:13,181] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_10_optim_states.pt [default3]:[2022-03-04 01:03:13,173] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_47_optim_states.pt [default2]:[2022-03-04 01:03:13,168] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_22_optim_states.pt [default2]:[2022-03-04 01:03:13,238] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_34_optim_states.pt [default5]:[2022-03-04 01:03:13,407] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_09_optim_states.pt [default4]:[2022-03-04 01:03:13,393] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_44_optim_states.pt [default1]:[2022-03-04 01:03:13,388] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_01_optim_states.pt [default7]:[2022-03-04 01:03:13,411] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_03_optim_states.pt [default6]:[2022-03-04 01:03:13,426] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_02_optim_states.pt [default7]:[2022-03-04 01:03:13,461] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_03_optim_states.pt [default0]:[2022-03-04 01:03:13,447] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_16_optim_states.pt [default7]:[2022-03-04 01:03:13,487] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_23_optim_states.pt [default5]:[2022-03-04 01:03:13,543] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_17_optim_states.pt [default7]:[2022-03-04 01:03:13,545] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_11_optim_states.pt [default3]:[2022-03-04 01:03:13,542] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_15_optim_states.pt [default2]:[2022-03-04 01:03:13,599] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_14_optim_states.pt [default3]:[2022-03-04 01:03:13,669] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_11_optim_states.pt [default4]:[2022-03-04 01:03:13,701] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_04_optim_states.pt [default0]:[2022-03-04 01:03:13,785] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_36_optim_states.pt [default2]:[2022-03-04 01:03:13,925] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_18_optim_states.pt [default2]:[2022-03-04 01:03:13,975] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_46_optim_states.pt [default3]:[2022-03-04 01:03:13,962] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_19_optim_states.pt [default0]:[2022-03-04 01:03:14,007] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_04_optim_states.pt [default4]:[2022-03-04 01:03:13,939] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_36_optim_states.pt [default7]:[2022-03-04 01:03:14,070] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_07_optim_states.pt [default6]:[2022-03-04 01:03:14,066] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_42_optim_states.pt [default6]:[2022-03-04 01:03:14,113] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_06_optim_states.pt [default0]:[2022-03-04 01:03:14,120] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_40_optim_states.pt [default7]:[2022-03-04 01:03:14,080] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_43_optim_states.pt [default1]:[2022-03-04 01:03:14,148] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_41_optim_states.pt [default4]:[2022-03-04 01:03:14,140] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_40_optim_states.pt [default6]:[2022-03-04 01:03:14,135] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_22_optim_states.pt [default0]:[2022-03-04 01:03:14,158] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [default2]:[2022-03-04 01:03:14,147] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_18_optim_states.pt [default5]:[2022-03-04 01:03:14,226] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_41_optim_states.pt [default3]:[2022-03-04 01:03:14,345] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_03_optim_states.pt [default1]:[2022-03-04 01:03:14,439] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_37_optim_states.pt [default3]:[2022-03-04 01:03:14,422] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_43_optim_states.pt [default2]:[2022-03-04 01:03:14,374] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_30_optim_states.pt [default6]:[2022-03-04 01:03:14,409] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_30_optim_states.pt [default3]:[2022-03-04 01:03:14,485] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_47_optim_states.pt [default2]:[2022-03-04 01:03:14,487] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_42_optim_states.pt [default2]:[2022-03-04 01:03:14,533] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_02_optim_states.pt [default7]:[2022-03-04 01:03:14,482] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_31_optim_states.pt [default3]:[2022-03-04 01:03:14,516] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_23_optim_states.pt [default2]:[2022-03-04 01:03:14,643] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_02_optim_states.pt [default7]:[2022-03-04 01:03:14,699] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_03_optim_states.pt [default6]:[2022-03-04 01:03:14,751] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_02_optim_states.pt [default0]:[2022-03-04 01:03:14,769] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_04_optim_states.pt [default5]:[2022-03-04 01:03:14,722] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_01_optim_states.pt [default4]:[2022-03-04 01:03:14,911] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt [default3]:[2022-03-04 01:03:14,975] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_27_optim_states.pt [default7]:[2022-03-04 01:03:14,936] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_47_optim_states.pt [default3]:[2022-03-04 01:03:15,058] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_03_optim_states.pt [default3]:[2022-03-04 01:03:15,307] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_11_optim_states.pt [default7]:[2022-03-04 01:03:15,430] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_03_optim_states.pt [default6]:[2022-03-04 01:03:15,426] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_02_optim_states.pt [default3]:[2022-03-04 01:03:15,436] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_47_optim_states.pt [default6]:[2022-03-04 01:03:15,422] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_38_optim_states.pt [default3]:[2022-03-04 01:03:15,557] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_27_optim_states.pt [default4]:[2022-03-04 01:03:15,525] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_28_optim_states.pt [default1]:[2022-03-04 01:03:15,524] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_37_optim_states.pt [default7]:[2022-03-04 01:03:15,574] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_39_optim_states.pt [default5]:[2022-03-04 01:03:15,608] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_29_optim_states.pt [default7]:[2022-03-04 01:03:15,672] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_19_optim_states.pt [default5]:[2022-03-04 01:03:15,730] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_05_optim_states.pt [default3]:[2022-03-04 01:03:15,691] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_19_optim_states.pt [default1]:[2022-03-04 01:03:15,816] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_05_optim_states.pt [default3]:[2022-03-04 01:03:15,875] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_35_optim_states.pt [default2]:[2022-03-04 01:03:15,919] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_26_optim_states.pt [default0]:[2022-03-04 01:03:15,945] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_04_optim_states.pt [default5]:[2022-03-04 01:03:15,984] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_33_optim_states.pt [default6]:[2022-03-04 01:03:16,002] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_18_optim_states.pt [default3]:[2022-03-04 01:03:16,144] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_07_optim_states.pt [default0]:[2022-03-04 01:03:16,181] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_04_optim_states.pt [default6]:[2022-03-04 01:03:16,211] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_06_optim_states.pt [default0]:[2022-03-04 01:03:16,295] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt [default2]:[2022-03-04 01:03:16,253] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_10_optim_states.pt [default2]:[2022-03-04 01:03:16,310] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_26_optim_states.pt [default6]:[2022-03-04 01:03:16,365] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_06_optim_states.pt [default1]:[2022-03-04 01:03:16,379] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_01_optim_states.pt [default0]:[2022-03-04 01:03:16,378] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_24_optim_states.pt [default1]:[2022-03-04 01:03:16,389] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_33_optim_states.pt [default7]:[2022-03-04 01:03:16,358] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_07_optim_states.pt [default1]:[2022-03-04 01:03:16,501] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_25_optim_states.pt [default4]:[2022-03-04 01:03:16,532] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_04_optim_states.pt [default5]:[2022-03-04 01:03:16,393] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_37_optim_states.pt [default3]:[2022-03-04 01:03:16,478] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_27_optim_states.pt [default1]:[2022-03-04 01:03:16,489] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_05_optim_states.pt [default2]:[2022-03-04 01:03:16,589] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_26_optim_states.pt [default0]:[2022-03-04 01:03:16,571] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_24_optim_states.pt [default6]:[2022-03-04 01:03:16,659] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_34_optim_states.pt [default4]:[2022-03-04 01:03:16,646] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_32_optim_states.pt [default2]:[2022-03-04 01:03:16,686] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_06_optim_states.pt [default1]:[2022-03-04 01:03:16,689] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_25_optim_states.pt [default1]:[2022-03-04 01:03:16,820] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_33_optim_states.pt [default7]:[2022-03-04 01:03:16,851] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_07_optim_states.pt [default6]:[2022-03-04 01:03:16,912] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_46_optim_states.pt [default4]:[2022-03-04 01:03:16,941] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_08_optim_states.pt [default6]:[2022-03-04 01:03:16,811] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_38_optim_states.pt [default2]:[2022-03-04 01:03:16,908] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_46_optim_states.pt [default7]:[2022-03-04 01:03:16,978] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_15_optim_states.pt [default2]:[2022-03-04 01:03:17,026] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_46_optim_states.pt [default1]:[2022-03-04 01:03:16,860] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_17_optim_states.pt [default7]:[2022-03-04 01:03:17,134] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_35_optim_states.pt [default0]:[2022-03-04 01:03:17,147] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_16_optim_states.pt [default0]:[2022-03-04 01:03:17,080] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_32_optim_states.pt [default5]:[2022-03-04 01:03:17,143] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_25_optim_states.pt [default7]:[2022-03-04 01:03:17,053] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_39_optim_states.pt [default7]:[2022-03-04 01:03:17,269] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_47_optim_states.pt [default7]:[2022-03-04 01:03:17,414] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_43_optim_states.pt [default1]:[2022-03-04 01:03:17,380] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_29_optim_states.pt [default2]:[2022-03-04 01:03:17,481] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_14_optim_states.pt [default6]:[2022-03-04 01:03:17,477] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_42_optim_states.pt [default2]:[2022-03-04 01:03:17,598] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_42_optim_states.pt [default0]:[2022-03-04 01:03:17,593] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_28_optim_states.pt [default1]:[2022-03-04 01:03:17,633] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_17_optim_states.pt [default3]:[2022-03-04 01:03:17,671] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_43_optim_states.pt [default3]:[2022-03-04 01:03:17,662] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_15_optim_states.pt [default6]:[2022-03-04 01:03:17,675] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_14_optim_states.pt [default0]:[2022-03-04 01:03:17,963] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_44_optim_states.pt [default0]:[2022-03-04 01:03:17,978] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_12_optim_states.pt [default1]:[2022-03-04 01:03:18,002] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_13_optim_states.pt [default6]:[2022-03-04 01:03:18,049] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_46_optim_states.pt [default5]:[2022-03-04 01:03:18,201] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_09_optim_states.pt [default5]:[2022-03-04 01:03:18,510] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_45_optim_states.pt [default7]:[2022-03-04 01:03:18,539] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_27_optim_states.pt [default1]:[2022-03-04 01:03:18,498] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_45_optim_states.pt [default0]:[2022-03-04 01:03:18,572] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_36_optim_states.pt [default1]:[2022-03-04 01:03:18,573] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_09_optim_states.pt [default5]:[2022-03-04 01:03:18,632] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_25_optim_states.pt [default7]:[2022-03-04 01:03:18,679] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_07_optim_states.pt [default0]:[2022-03-04 01:03:18,649] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_08_optim_states.pt [default6]:[2022-03-04 01:03:18,706] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_06_optim_states.pt [default4]:[2022-03-04 01:03:18,775] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_12_optim_states.pt [default1]:[2022-03-04 01:03:18,836] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_09_optim_states.pt [default2]:[2022-03-04 01:03:18,807] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_38_optim_states.pt [default6]:[2022-03-04 01:03:18,846] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_18_optim_states.pt [default4]:[2022-03-04 01:03:18,788] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_44_optim_states.pt [default3]:[2022-03-04 01:03:18,805] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_39_optim_states.pt [default4]:[2022-03-04 01:03:18,846] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_24_optim_states.pt [default7]:[2022-03-04 01:03:18,871] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_3_mp_rank_19_optim_states.pt [default0]:[2022-03-04 01:03:18,846] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_08_optim_states.pt [default4]:[2022-03-04 01:03:18,930] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_24_optim_states.pt [default5]:[2022-03-04 01:03:18,944] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_13_optim_states.pt [default5]:[2022-03-04 01:03:18,918] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_25_optim_states.pt [default4]:[2022-03-04 01:03:18,943] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_24_optim_states.pt [default0]:[2022-03-04 01:03:19,062] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_44_optim_states.pt [default1]:[2022-03-04 01:03:19,113] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_45_optim_states.pt [default2]:[2022-03-04 01:03:19,125] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_34_optim_states.pt [default6]:[2022-03-04 01:03:19,149] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_26_optim_states.pt [default4]:[2022-03-04 01:03:19,229] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_36_optim_states.pt [default2]:[2022-03-04 01:03:19,254] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_18_optim_states.pt [default7]:[2022-03-04 01:03:19,252] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_27_optim_states.pt [default3]:[2022-03-04 01:03:19,239] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_2_mp_rank_19_optim_states.pt [default5]:[2022-03-04 01:03:19,224] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_41_optim_states.pt [default5]:[2022-03-04 01:03:19,274] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_37_optim_states.pt [default0]:[2022-03-04 01:03:19,372] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_24_optim_states.pt [default1]:[2022-03-04 01:03:19,381] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_41_optim_states.pt [default1]:[2022-03-04 01:03:19,394] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_25_optim_states.pt [default0]:[2022-03-04 01:03:19,631] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_40_optim_states.pt [default3]:[2022-03-04 01:03:19,729] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_35_optim_states.pt [default4]:[2022-03-04 01:03:19,792] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_40_optim_states.pt [default4]:[2022-03-04 01:03:19,855] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_20_optim_states.pt [default6]:[2022-03-04 01:03:19,886] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_5_mp_rank_26_optim_states.pt [default0]:[2022-03-04 01:03:20,045] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_20_optim_states.pt [default1]:[2022-03-04 01:03:19,971] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_0_mp_rank_21_optim_states.pt [default6]:[2022-03-04 01:03:20,338] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_22_optim_states.pt [default7]:[2022-03-04 01:03:20,295] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_27_optim_states.pt [default6]:[2022-03-04 01:03:20,402] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_26_optim_states.pt [default7]:[2022-03-04 01:03:20,376] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_23_optim_states.pt [default5]:[2022-03-04 01:03:20,370] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_05_optim_states.pt [default6]:[2022-03-04 01:03:20,431] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_34_optim_states.pt [default0]:[2022-03-04 01:03:20,421] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_32_optim_states.pt [default4]:[2022-03-04 01:03:20,454] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_04_optim_states.pt [default7]:[2022-03-04 01:03:20,464] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_35_optim_states.pt [default5]:[2022-03-04 01:03:20,583] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_1_mp_rank_21_optim_states.pt [default3]:[2022-03-04 01:03:20,708] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_07_optim_states.pt [default2]:[2022-03-04 01:03:20,732] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_6_mp_rank_06_optim_states.pt [default5]:[2022-03-04 01:03:20,858] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_33_optim_states.pt [default4]:[2022-03-04 01:03:20,881] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_32_optim_states.pt [default3]:[2022-03-04 01:03:21,772] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_39_optim_states.pt [default2]:[2022-03-04 01:03:21,812] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_4_mp_rank_38_optim_states.pt [default5]:[2022-03-04 01:03:22,244] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_17_optim_states.pt [default7]:time (ms) | save-checkpoint: 37771.40 [default0]: successfully saved checkpoint at iteration 4500 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default4]:[2022-03-04 01:03:22,298] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4500/bf16_zero_pp_rank_7_mp_rank_16_optim_states.pt [default7]: iteration 4501/ 128728 | consumed samples: 72016 | consumed tokens: 147488768 | elapsed time per iteration (s): 52.98 | learning rate: 2.360E-05 | global batch size: 16 | lm loss: 5.325017E+00 | grad norm: 0.863 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 0.302 | TFLOPs: 2.31 | [default7]: iteration 4502/ 128728 | consumed samples: 72032 | consumed tokens: 147521536 | elapsed time per iteration (s): 15.24 | learning rate: 2.360E-05 | global batch size: 16 | lm loss: 5.206753E+00 | grad norm: 0.737 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4503/ 128728 | consumed samples: 72048 | consumed tokens: 147554304 | elapsed time per iteration (s): 15.17 | learning rate: 2.361E-05 | global batch size: 16 | lm loss: 5.296180E+00 | grad norm: 0.673 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 4504/ 128728 | consumed samples: 72064 | consumed tokens: 147587072 | elapsed time per iteration (s): 15.21 | learning rate: 2.361E-05 | global batch size: 16 | lm loss: 5.398469E+00 | grad norm: 1.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4505/ 128728 | consumed samples: 72080 | consumed tokens: 147619840 | elapsed time per iteration (s): 15.23 | learning rate: 2.362E-05 | global batch size: 16 | lm loss: 5.553847E+00 | grad norm: 0.828 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4506/ 128728 | consumed samples: 72096 | consumed tokens: 147652608 | elapsed time per iteration (s): 15.21 | learning rate: 2.362E-05 | global batch size: 16 | lm loss: 5.168607E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4507/ 128728 | consumed samples: 72112 | consumed tokens: 147685376 | elapsed time per iteration (s): 15.21 | learning rate: 2.363E-05 | global batch size: 16 | lm loss: 5.327075E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4508/ 128728 | consumed samples: 72128 | consumed tokens: 147718144 | elapsed time per iteration (s): 15.14 | learning rate: 2.363E-05 | global batch size: 16 | lm loss: 5.287652E+00 | grad norm: 0.900 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 4509/ 128728 | consumed samples: 72144 | consumed tokens: 147750912 | elapsed time per iteration (s): 15.25 | learning rate: 2.364E-05 | global batch size: 16 | lm loss: 5.220299E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4510/ 128728 | consumed samples: 72160 | consumed tokens: 147783680 | elapsed time per iteration (s): 15.25 | learning rate: 2.365E-05 | global batch size: 16 | lm loss: 5.033144E+00 | grad norm: 0.803 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4511/ 128728 | consumed samples: 72176 | consumed tokens: 147816448 | elapsed time per iteration (s): 15.22 | learning rate: 2.365E-05 | global batch size: 16 | lm loss: 5.558724E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4512/ 128728 | consumed samples: 72192 | consumed tokens: 147849216 | elapsed time per iteration (s): 15.14 | learning rate: 2.366E-05 | global batch size: 16 | lm loss: 5.105259E+00 | grad norm: 0.682 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.057 | TFLOPs: 8.09 | [default7]: iteration 4513/ 128728 | consumed samples: 72208 | consumed tokens: 147881984 | elapsed time per iteration (s): 15.21 | learning rate: 2.366E-05 | global batch size: 16 | lm loss: 5.260120E+00 | grad norm: 0.663 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4514/ 128728 | consumed samples: 72224 | consumed tokens: 147914752 | elapsed time per iteration (s): 15.23 | learning rate: 2.367E-05 | global batch size: 16 | lm loss: 4.996598E+00 | grad norm: 0.791 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4515/ 128728 | consumed samples: 72240 | consumed tokens: 147947520 | elapsed time per iteration (s): 15.19 | learning rate: 2.367E-05 | global batch size: 16 | lm loss: 5.050591E+00 | grad norm: 0.778 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4516/ 128728 | consumed samples: 72256 | consumed tokens: 147980288 | elapsed time per iteration (s): 15.22 | learning rate: 2.368E-05 | global batch size: 16 | lm loss: 5.226483E+00 | grad norm: 1.428 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4517/ 128728 | consumed samples: 72272 | consumed tokens: 148013056 | elapsed time per iteration (s): 15.22 | learning rate: 2.368E-05 | global batch size: 16 | lm loss: 4.994648E+00 | grad norm: 0.960 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4518/ 128728 | consumed samples: 72288 | consumed tokens: 148045824 | elapsed time per iteration (s): 15.19 | learning rate: 2.369E-05 | global batch size: 16 | lm loss: 5.458125E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 4519/ 128728 | consumed samples: 72304 | consumed tokens: 148078592 | elapsed time per iteration (s): 15.22 | learning rate: 2.369E-05 | global batch size: 16 | lm loss: 5.218241E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4520/ 128728 | consumed samples: 72320 | consumed tokens: 148111360 | elapsed time per iteration (s): 15.21 | learning rate: 2.370E-05 | global batch size: 16 | lm loss: 5.292453E+00 | grad norm: 0.776 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4521/ 128728 | consumed samples: 72336 | consumed tokens: 148144128 | elapsed time per iteration (s): 15.22 | learning rate: 2.370E-05 | global batch size: 16 | lm loss: 5.189533E+00 | grad norm: 0.703 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4522/ 128728 | consumed samples: 72352 | consumed tokens: 148176896 | elapsed time per iteration (s): 15.21 | learning rate: 2.371E-05 | global batch size: 16 | lm loss: 5.006850E+00 | grad norm: 2.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4523/ 128728 | consumed samples: 72368 | consumed tokens: 148209664 | elapsed time per iteration (s): 15.21 | learning rate: 2.371E-05 | global batch size: 16 | lm loss: 5.264925E+00 | grad norm: 0.741 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4524/ 128728 | consumed samples: 72384 | consumed tokens: 148242432 | elapsed time per iteration (s): 15.22 | learning rate: 2.372E-05 | global batch size: 16 | lm loss: 5.519560E+00 | grad norm: 0.644 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4525/ 128728 | consumed samples: 72400 | consumed tokens: 148275200 | elapsed time per iteration (s): 15.21 | learning rate: 2.372E-05 | global batch size: 16 | lm loss: 5.181821E+00 | grad norm: 1.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4526/ 128728 | consumed samples: 72416 | consumed tokens: 148307968 | elapsed time per iteration (s): 15.22 | learning rate: 2.373E-05 | global batch size: 16 | lm loss: 5.311499E+00 | grad norm: 0.961 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4527/ 128728 | consumed samples: 72432 | consumed tokens: 148340736 | elapsed time per iteration (s): 15.18 | learning rate: 2.373E-05 | global batch size: 16 | lm loss: 5.167645E+00 | grad norm: 0.852 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4528/ 128728 | consumed samples: 72448 | consumed tokens: 148373504 | elapsed time per iteration (s): 15.25 | learning rate: 2.374E-05 | global batch size: 16 | lm loss: 5.202123E+00 | grad norm: 1.083 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.04 | [default7]: iteration 4529/ 128728 | consumed samples: 72464 | consumed tokens: 148406272 | elapsed time per iteration (s): 15.22 | learning rate: 2.375E-05 | global batch size: 16 | lm loss: 5.369713E+00 | grad norm: 0.977 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4530/ 128728 | consumed samples: 72480 | consumed tokens: 148439040 | elapsed time per iteration (s): 15.23 | learning rate: 2.375E-05 | global batch size: 16 | lm loss: 5.040470E+00 | grad norm: 0.824 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4531/ 128728 | consumed samples: 72496 | consumed tokens: 148471808 | elapsed time per iteration (s): 15.26 | learning rate: 2.376E-05 | global batch size: 16 | lm loss: 5.086207E+00 | grad norm: 1.010 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4532/ 128728 | consumed samples: 72512 | consumed tokens: 148504576 | elapsed time per iteration (s): 15.22 | learning rate: 2.376E-05 | global batch size: 16 | lm loss: 5.150359E+00 | grad norm: 1.260 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4533/ 128728 | consumed samples: 72528 | consumed tokens: 148537344 | elapsed time per iteration (s): 15.24 | learning rate: 2.377E-05 | global batch size: 16 | lm loss: 5.247553E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4534/ 128728 | consumed samples: 72544 | consumed tokens: 148570112 | elapsed time per iteration (s): 15.23 | learning rate: 2.377E-05 | global batch size: 16 | lm loss: 5.214560E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4535/ 128728 | consumed samples: 72560 | consumed tokens: 148602880 | elapsed time per iteration (s): 15.21 | learning rate: 2.378E-05 | global batch size: 16 | lm loss: 5.090154E+00 | grad norm: 0.686 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4536/ 128728 | consumed samples: 72576 | consumed tokens: 148635648 | elapsed time per iteration (s): 15.23 | learning rate: 2.378E-05 | global batch size: 16 | lm loss: 4.961235E+00 | grad norm: 0.830 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4537/ 128728 | consumed samples: 72592 | consumed tokens: 148668416 | elapsed time per iteration (s): 15.20 | learning rate: 2.379E-05 | global batch size: 16 | lm loss: 5.200741E+00 | grad norm: 0.791 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4538/ 128728 | consumed samples: 72608 | consumed tokens: 148701184 | elapsed time per iteration (s): 15.25 | learning rate: 2.379E-05 | global batch size: 16 | lm loss: 5.063721E+00 | grad norm: 0.679 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.04 | [default7]: iteration 4539/ 128728 | consumed samples: 72624 | consumed tokens: 148733952 | elapsed time per iteration (s): 15.21 | learning rate: 2.380E-05 | global batch size: 16 | lm loss: 5.377962E+00 | grad norm: 0.984 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4540/ 128728 | consumed samples: 72640 | consumed tokens: 148766720 | elapsed time per iteration (s): 15.21 | learning rate: 2.380E-05 | global batch size: 16 | lm loss: 5.393027E+00 | grad norm: 0.898 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4541/ 128728 | consumed samples: 72656 | consumed tokens: 148799488 | elapsed time per iteration (s): 15.19 | learning rate: 2.381E-05 | global batch size: 16 | lm loss: 5.115465E+00 | grad norm: 0.701 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 4542/ 128728 | consumed samples: 72672 | consumed tokens: 148832256 | elapsed time per iteration (s): 15.23 | learning rate: 2.381E-05 | global batch size: 16 | lm loss: 5.172780E+00 | grad norm: 1.032 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4543/ 128728 | consumed samples: 72688 | consumed tokens: 148865024 | elapsed time per iteration (s): 15.21 | learning rate: 2.382E-05 | global batch size: 16 | lm loss: 5.387748E+00 | grad norm: 0.885 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4544/ 128728 | consumed samples: 72704 | consumed tokens: 148897792 | elapsed time per iteration (s): 15.21 | learning rate: 2.382E-05 | global batch size: 16 | lm loss: 5.250667E+00 | grad norm: 1.622 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4545/ 128728 | consumed samples: 72720 | consumed tokens: 148930560 | elapsed time per iteration (s): 15.22 | learning rate: 2.383E-05 | global batch size: 16 | lm loss: 5.358253E+00 | grad norm: 0.747 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4546/ 128728 | consumed samples: 72736 | consumed tokens: 148963328 | elapsed time per iteration (s): 15.21 | learning rate: 2.383E-05 | global batch size: 16 | lm loss: 5.096012E+00 | grad norm: 0.807 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4547/ 128728 | consumed samples: 72752 | consumed tokens: 148996096 | elapsed time per iteration (s): 15.20 | learning rate: 2.384E-05 | global batch size: 16 | lm loss: 4.942961E+00 | grad norm: 1.755 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4548/ 128728 | consumed samples: 72768 | consumed tokens: 149028864 | elapsed time per iteration (s): 15.21 | learning rate: 2.384E-05 | global batch size: 16 | lm loss: 5.277761E+00 | grad norm: 0.722 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4549/ 128728 | consumed samples: 72784 | consumed tokens: 149061632 | elapsed time per iteration (s): 15.22 | learning rate: 2.385E-05 | global batch size: 16 | lm loss: 5.401462E+00 | grad norm: 1.071 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4550/ 128728 | consumed samples: 72800 | consumed tokens: 149094400 | elapsed time per iteration (s): 15.25 | learning rate: 2.386E-05 | global batch size: 16 | lm loss: 5.125511E+00 | grad norm: 0.660 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4551/ 128728 | consumed samples: 72816 | consumed tokens: 149127168 | elapsed time per iteration (s): 15.23 | learning rate: 2.386E-05 | global batch size: 16 | lm loss: 5.149467E+00 | grad norm: 1.018 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4552/ 128728 | consumed samples: 72832 | consumed tokens: 149159936 | elapsed time per iteration (s): 15.22 | learning rate: 2.387E-05 | global batch size: 16 | lm loss: 5.229480E+00 | grad norm: 0.963 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4553/ 128728 | consumed samples: 72848 | consumed tokens: 149192704 | elapsed time per iteration (s): 15.20 | learning rate: 2.387E-05 | global batch size: 16 | lm loss: 5.411103E+00 | grad norm: 0.689 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4554/ 128728 | consumed samples: 72864 | consumed tokens: 149225472 | elapsed time per iteration (s): 15.20 | learning rate: 2.388E-05 | global batch size: 16 | lm loss: 5.420312E+00 | grad norm: 1.593 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4555/ 128728 | consumed samples: 72880 | consumed tokens: 149258240 | elapsed time per iteration (s): 15.21 | learning rate: 2.388E-05 | global batch size: 16 | lm loss: 5.258182E+00 | grad norm: 0.650 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4556/ 128728 | consumed samples: 72896 | consumed tokens: 149291008 | elapsed time per iteration (s): 15.20 | learning rate: 2.389E-05 | global batch size: 16 | lm loss: 5.368918E+00 | grad norm: 1.053 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4557/ 128728 | consumed samples: 72912 | consumed tokens: 149323776 | elapsed time per iteration (s): 15.21 | learning rate: 2.389E-05 | global batch size: 16 | lm loss: 5.145999E+00 | grad norm: 0.753 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4558/ 128728 | consumed samples: 72928 | consumed tokens: 149356544 | elapsed time per iteration (s): 15.19 | learning rate: 2.390E-05 | global batch size: 16 | lm loss: 5.343250E+00 | grad norm: 0.787 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4559/ 128728 | consumed samples: 72944 | consumed tokens: 149389312 | elapsed time per iteration (s): 15.20 | learning rate: 2.390E-05 | global batch size: 16 | lm loss: 5.249984E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4560/ 128728 | consumed samples: 72960 | consumed tokens: 149422080 | elapsed time per iteration (s): 15.23 | learning rate: 2.391E-05 | global batch size: 16 | lm loss: 5.127768E+00 | grad norm: 0.637 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4561/ 128728 | consumed samples: 72976 | consumed tokens: 149454848 | elapsed time per iteration (s): 15.20 | learning rate: 2.391E-05 | global batch size: 16 | lm loss: 5.086662E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4562/ 128728 | consumed samples: 72992 | consumed tokens: 149487616 | elapsed time per iteration (s): 15.20 | learning rate: 2.392E-05 | global batch size: 16 | lm loss: 5.438632E+00 | grad norm: 0.731 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4563/ 128728 | consumed samples: 73008 | consumed tokens: 149520384 | elapsed time per iteration (s): 15.22 | learning rate: 2.392E-05 | global batch size: 16 | lm loss: 5.137195E+00 | grad norm: 0.921 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4564/ 128728 | consumed samples: 73024 | consumed tokens: 149553152 | elapsed time per iteration (s): 15.17 | learning rate: 2.393E-05 | global batch size: 16 | lm loss: 5.080501E+00 | grad norm: 0.845 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4565/ 128728 | consumed samples: 73040 | consumed tokens: 149585920 | elapsed time per iteration (s): 15.21 | learning rate: 2.393E-05 | global batch size: 16 | lm loss: 5.107949E+00 | grad norm: 1.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4566/ 128728 | consumed samples: 73056 | consumed tokens: 149618688 | elapsed time per iteration (s): 15.22 | learning rate: 2.394E-05 | global batch size: 16 | lm loss: 5.110487E+00 | grad norm: 0.819 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4567/ 128728 | consumed samples: 73072 | consumed tokens: 149651456 | elapsed time per iteration (s): 15.22 | learning rate: 2.394E-05 | global batch size: 16 | lm loss: 5.108166E+00 | grad norm: 1.722 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4568/ 128728 | consumed samples: 73088 | consumed tokens: 149684224 | elapsed time per iteration (s): 15.25 | learning rate: 2.395E-05 | global batch size: 16 | lm loss: 5.194795E+00 | grad norm: 0.838 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4569/ 128728 | consumed samples: 73104 | consumed tokens: 149716992 | elapsed time per iteration (s): 15.22 | learning rate: 2.395E-05 | global batch size: 16 | lm loss: 5.168123E+00 | grad norm: 0.673 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4570/ 128728 | consumed samples: 73120 | consumed tokens: 149749760 | elapsed time per iteration (s): 15.20 | learning rate: 2.396E-05 | global batch size: 16 | lm loss: 5.355202E+00 | grad norm: 0.956 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4571/ 128728 | consumed samples: 73136 | consumed tokens: 149782528 | elapsed time per iteration (s): 15.25 | learning rate: 2.397E-05 | global batch size: 16 | lm loss: 5.210347E+00 | grad norm: 0.789 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4572/ 128728 | consumed samples: 73152 | consumed tokens: 149815296 | elapsed time per iteration (s): 15.17 | learning rate: 2.397E-05 | global batch size: 16 | lm loss: 5.141915E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.07 | [default7]: iteration 4573/ 128728 | consumed samples: 73168 | consumed tokens: 149848064 | elapsed time per iteration (s): 15.19 | learning rate: 2.398E-05 | global batch size: 16 | lm loss: 5.015357E+00 | grad norm: 0.778 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4574/ 128728 | consumed samples: 73184 | consumed tokens: 149880832 | elapsed time per iteration (s): 15.24 | learning rate: 2.398E-05 | global batch size: 16 | lm loss: 5.284767E+00 | grad norm: 1.642 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4575/ 128728 | consumed samples: 73200 | consumed tokens: 149913600 | elapsed time per iteration (s): 15.25 | learning rate: 2.399E-05 | global batch size: 16 | lm loss: 5.151593E+00 | grad norm: 1.277 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4576/ 128728 | consumed samples: 73216 | consumed tokens: 149946368 | elapsed time per iteration (s): 15.24 | learning rate: 2.399E-05 | global batch size: 16 | lm loss: 5.201889E+00 | grad norm: 0.829 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4577/ 128728 | consumed samples: 73232 | consumed tokens: 149979136 | elapsed time per iteration (s): 15.19 | learning rate: 2.400E-05 | global batch size: 16 | lm loss: 5.358136E+00 | grad norm: 0.889 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4578/ 128728 | consumed samples: 73248 | consumed tokens: 150011904 | elapsed time per iteration (s): 15.21 | learning rate: 2.400E-05 | global batch size: 16 | lm loss: 5.094169E+00 | grad norm: 0.829 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4579/ 128728 | consumed samples: 73264 | consumed tokens: 150044672 | elapsed time per iteration (s): 15.21 | learning rate: 2.401E-05 | global batch size: 16 | lm loss: 5.261844E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4580/ 128728 | consumed samples: 73280 | consumed tokens: 150077440 | elapsed time per iteration (s): 15.24 | learning rate: 2.401E-05 | global batch size: 16 | lm loss: 5.281607E+00 | grad norm: 1.033 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4581/ 128728 | consumed samples: 73296 | consumed tokens: 150110208 | elapsed time per iteration (s): 15.23 | learning rate: 2.402E-05 | global batch size: 16 | lm loss: 5.304956E+00 | grad norm: 0.903 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4582/ 128728 | consumed samples: 73312 | consumed tokens: 150142976 | elapsed time per iteration (s): 15.23 | learning rate: 2.402E-05 | global batch size: 16 | lm loss: 4.882883E+00 | grad norm: 1.048 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4583/ 128728 | consumed samples: 73328 | consumed tokens: 150175744 | elapsed time per iteration (s): 15.19 | learning rate: 2.403E-05 | global batch size: 16 | lm loss: 4.978672E+00 | grad norm: 0.938 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 4584/ 128728 | consumed samples: 73344 | consumed tokens: 150208512 | elapsed time per iteration (s): 15.20 | learning rate: 2.403E-05 | global batch size: 16 | lm loss: 5.311226E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4585/ 128728 | consumed samples: 73360 | consumed tokens: 150241280 | elapsed time per iteration (s): 15.18 | learning rate: 2.404E-05 | global batch size: 16 | lm loss: 5.109036E+00 | grad norm: 0.781 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4586/ 128728 | consumed samples: 73376 | consumed tokens: 150274048 | elapsed time per iteration (s): 15.19 | learning rate: 2.404E-05 | global batch size: 16 | lm loss: 5.296421E+00 | grad norm: 0.889 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 4587/ 128728 | consumed samples: 73392 | consumed tokens: 150306816 | elapsed time per iteration (s): 15.23 | learning rate: 2.405E-05 | global batch size: 16 | lm loss: 5.218729E+00 | grad norm: 0.772 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4588/ 128728 | consumed samples: 73408 | consumed tokens: 150339584 | elapsed time per iteration (s): 15.19 | learning rate: 2.405E-05 | global batch size: 16 | lm loss: 5.307782E+00 | grad norm: 0.932 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 4589/ 128728 | consumed samples: 73424 | consumed tokens: 150372352 | elapsed time per iteration (s): 15.20 | learning rate: 2.406E-05 | global batch size: 16 | lm loss: 5.305587E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4590/ 128728 | consumed samples: 73440 | consumed tokens: 150405120 | elapsed time per iteration (s): 15.21 | learning rate: 2.406E-05 | global batch size: 16 | lm loss: 5.118801E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4591/ 128728 | consumed samples: 73456 | consumed tokens: 150437888 | elapsed time per iteration (s): 15.25 | learning rate: 2.407E-05 | global batch size: 16 | lm loss: 5.188321E+00 | grad norm: 0.768 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4592/ 128728 | consumed samples: 73472 | consumed tokens: 150470656 | elapsed time per iteration (s): 15.22 | learning rate: 2.408E-05 | global batch size: 16 | lm loss: 5.110636E+00 | grad norm: 0.977 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4593/ 128728 | consumed samples: 73488 | consumed tokens: 150503424 | elapsed time per iteration (s): 15.23 | learning rate: 2.408E-05 | global batch size: 16 | lm loss: 5.186214E+00 | grad norm: 0.971 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4594/ 128728 | consumed samples: 73504 | consumed tokens: 150536192 | elapsed time per iteration (s): 15.16 | learning rate: 2.409E-05 | global batch size: 16 | lm loss: 5.208476E+00 | grad norm: 0.795 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 4595/ 128728 | consumed samples: 73520 | consumed tokens: 150568960 | elapsed time per iteration (s): 15.21 | learning rate: 2.409E-05 | global batch size: 16 | lm loss: 5.445783E+00 | grad norm: 0.783 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4596/ 128728 | consumed samples: 73536 | consumed tokens: 150601728 | elapsed time per iteration (s): 15.24 | learning rate: 2.410E-05 | global batch size: 16 | lm loss: 5.119749E+00 | grad norm: 1.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4597/ 128728 | consumed samples: 73552 | consumed tokens: 150634496 | elapsed time per iteration (s): 15.22 | learning rate: 2.410E-05 | global batch size: 16 | lm loss: 5.283437E+00 | grad norm: 0.946 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4598/ 128728 | consumed samples: 73568 | consumed tokens: 150667264 | elapsed time per iteration (s): 15.20 | learning rate: 2.411E-05 | global batch size: 16 | lm loss: 5.217893E+00 | grad norm: 0.682 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4599/ 128728 | consumed samples: 73584 | consumed tokens: 150700032 | elapsed time per iteration (s): 15.19 | learning rate: 2.411E-05 | global batch size: 16 | lm loss: 5.280108E+00 | grad norm: 0.780 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4600/ 128728 | consumed samples: 73600 | consumed tokens: 150732800 | elapsed time per iteration (s): 15.21 | learning rate: 2.412E-05 | global batch size: 16 | lm loss: 5.047978E+00 | grad norm: 0.763 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4601/ 128728 | consumed samples: 73616 | consumed tokens: 150765568 | elapsed time per iteration (s): 15.21 | learning rate: 2.412E-05 | global batch size: 16 | lm loss: 5.097135E+00 | grad norm: 1.073 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4602/ 128728 | consumed samples: 73632 | consumed tokens: 150798336 | elapsed time per iteration (s): 15.23 | learning rate: 2.413E-05 | global batch size: 16 | lm loss: 5.246779E+00 | grad norm: 0.959 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4603/ 128728 | consumed samples: 73648 | consumed tokens: 150831104 | elapsed time per iteration (s): 15.21 | learning rate: 2.413E-05 | global batch size: 16 | lm loss: 5.140010E+00 | grad norm: 0.700 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4604/ 128728 | consumed samples: 73664 | consumed tokens: 150863872 | elapsed time per iteration (s): 15.22 | learning rate: 2.414E-05 | global batch size: 16 | lm loss: 5.305626E+00 | grad norm: 0.838 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4605/ 128728 | consumed samples: 73680 | consumed tokens: 150896640 | elapsed time per iteration (s): 15.21 | learning rate: 2.414E-05 | global batch size: 16 | lm loss: 4.927595E+00 | grad norm: 0.724 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4606/ 128728 | consumed samples: 73696 | consumed tokens: 150929408 | elapsed time per iteration (s): 15.21 | learning rate: 2.415E-05 | global batch size: 16 | lm loss: 5.303552E+00 | grad norm: 0.848 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4607/ 128728 | consumed samples: 73712 | consumed tokens: 150962176 | elapsed time per iteration (s): 15.23 | learning rate: 2.415E-05 | global batch size: 16 | lm loss: 5.152580E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4608/ 128728 | consumed samples: 73728 | consumed tokens: 150994944 | elapsed time per iteration (s): 15.21 | learning rate: 2.416E-05 | global batch size: 16 | lm loss: 5.410992E+00 | grad norm: 0.954 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4609/ 128728 | consumed samples: 73744 | consumed tokens: 151027712 | elapsed time per iteration (s): 15.22 | learning rate: 2.416E-05 | global batch size: 16 | lm loss: 5.160129E+00 | grad norm: 0.786 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4610/ 128728 | consumed samples: 73760 | consumed tokens: 151060480 | elapsed time per iteration (s): 15.22 | learning rate: 2.417E-05 | global batch size: 16 | lm loss: 5.234522E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4611/ 128728 | consumed samples: 73776 | consumed tokens: 151093248 | elapsed time per iteration (s): 15.22 | learning rate: 2.417E-05 | global batch size: 16 | lm loss: 5.088044E+00 | grad norm: 0.638 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4612/ 128728 | consumed samples: 73792 | consumed tokens: 151126016 | elapsed time per iteration (s): 15.22 | learning rate: 2.418E-05 | global batch size: 16 | lm loss: 5.261300E+00 | grad norm: 0.895 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4613/ 128728 | consumed samples: 73808 | consumed tokens: 151158784 | elapsed time per iteration (s): 15.23 | learning rate: 2.419E-05 | global batch size: 16 | lm loss: 5.207508E+00 | grad norm: 0.785 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4614/ 128728 | consumed samples: 73824 | consumed tokens: 151191552 | elapsed time per iteration (s): 15.23 | learning rate: 2.419E-05 | global batch size: 16 | lm loss: 5.234620E+00 | grad norm: 1.571 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4615/ 128728 | consumed samples: 73840 | consumed tokens: 151224320 | elapsed time per iteration (s): 15.26 | learning rate: 2.420E-05 | global batch size: 16 | lm loss: 5.073845E+00 | grad norm: 0.915 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 4616/ 128728 | consumed samples: 73856 | consumed tokens: 151257088 | elapsed time per iteration (s): 15.24 | learning rate: 2.420E-05 | global batch size: 16 | lm loss: 4.991200E+00 | grad norm: 1.026 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4617/ 128728 | consumed samples: 73872 | consumed tokens: 151289856 | elapsed time per iteration (s): 15.22 | learning rate: 2.421E-05 | global batch size: 16 | lm loss: 5.139315E+00 | grad norm: 0.946 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4618/ 128728 | consumed samples: 73888 | consumed tokens: 151322624 | elapsed time per iteration (s): 15.21 | learning rate: 2.421E-05 | global batch size: 16 | lm loss: 5.159419E+00 | grad norm: 0.950 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4619/ 128728 | consumed samples: 73904 | consumed tokens: 151355392 | elapsed time per iteration (s): 15.20 | learning rate: 2.422E-05 | global batch size: 16 | lm loss: 5.040611E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4620/ 128728 | consumed samples: 73920 | consumed tokens: 151388160 | elapsed time per iteration (s): 15.23 | learning rate: 2.422E-05 | global batch size: 16 | lm loss: 5.300824E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4621/ 128728 | consumed samples: 73936 | consumed tokens: 151420928 | elapsed time per iteration (s): 15.22 | learning rate: 2.423E-05 | global batch size: 16 | lm loss: 5.181660E+00 | grad norm: 0.776 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4622/ 128728 | consumed samples: 73952 | consumed tokens: 151453696 | elapsed time per iteration (s): 15.20 | learning rate: 2.423E-05 | global batch size: 16 | lm loss: 5.045792E+00 | grad norm: 0.930 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4623/ 128728 | consumed samples: 73968 | consumed tokens: 151486464 | elapsed time per iteration (s): 15.27 | learning rate: 2.424E-05 | global batch size: 16 | lm loss: 4.973166E+00 | grad norm: 0.913 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.02 | [default7]: iteration 4624/ 128728 | consumed samples: 73984 | consumed tokens: 151519232 | elapsed time per iteration (s): 15.28 | learning rate: 2.424E-05 | global batch size: 16 | lm loss: 5.020543E+00 | grad norm: 0.667 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.047 | TFLOPs: 8.02 | [default7]: iteration 4625/ 128728 | consumed samples: 74000 | consumed tokens: 151552000 | elapsed time per iteration (s): 15.23 | learning rate: 2.425E-05 | global batch size: 16 | lm loss: 5.428620E+00 | grad norm: 0.757 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4626/ 128728 | consumed samples: 74016 | consumed tokens: 151584768 | elapsed time per iteration (s): 15.24 | learning rate: 2.425E-05 | global batch size: 16 | lm loss: 5.210262E+00 | grad norm: 0.820 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4627/ 128728 | consumed samples: 74032 | consumed tokens: 151617536 | elapsed time per iteration (s): 15.23 | learning rate: 2.426E-05 | global batch size: 16 | lm loss: 5.440079E+00 | grad norm: 0.801 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4628/ 128728 | consumed samples: 74048 | consumed tokens: 151650304 | elapsed time per iteration (s): 15.21 | learning rate: 2.426E-05 | global batch size: 16 | lm loss: 5.092575E+00 | grad norm: 0.789 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4629/ 128728 | consumed samples: 74064 | consumed tokens: 151683072 | elapsed time per iteration (s): 15.19 | learning rate: 2.427E-05 | global batch size: 16 | lm loss: 5.193363E+00 | grad norm: 0.904 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.07 | [default7]: iteration 4630/ 128728 | consumed samples: 74080 | consumed tokens: 151715840 | elapsed time per iteration (s): 15.22 | learning rate: 2.427E-05 | global batch size: 16 | lm loss: 5.114410E+00 | grad norm: 0.795 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4631/ 128728 | consumed samples: 74096 | consumed tokens: 151748608 | elapsed time per iteration (s): 15.24 | learning rate: 2.428E-05 | global batch size: 16 | lm loss: 5.234143E+00 | grad norm: 0.670 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4632/ 128728 | consumed samples: 74112 | consumed tokens: 151781376 | elapsed time per iteration (s): 15.17 | learning rate: 2.429E-05 | global batch size: 16 | lm loss: 5.123685E+00 | grad norm: 0.752 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4633/ 128728 | consumed samples: 74128 | consumed tokens: 151814144 | elapsed time per iteration (s): 15.21 | learning rate: 2.429E-05 | global batch size: 16 | lm loss: 5.050491E+00 | grad norm: 0.815 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4634/ 128728 | consumed samples: 74144 | consumed tokens: 151846912 | elapsed time per iteration (s): 15.18 | learning rate: 2.430E-05 | global batch size: 16 | lm loss: 5.095937E+00 | grad norm: 0.738 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4635/ 128728 | consumed samples: 74160 | consumed tokens: 151879680 | elapsed time per iteration (s): 15.21 | learning rate: 2.430E-05 | global batch size: 16 | lm loss: 5.212114E+00 | grad norm: 1.000 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4636/ 128728 | consumed samples: 74176 | consumed tokens: 151912448 | elapsed time per iteration (s): 15.22 | learning rate: 2.431E-05 | global batch size: 16 | lm loss: 5.032464E+00 | grad norm: 0.744 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4637/ 128728 | consumed samples: 74192 | consumed tokens: 151945216 | elapsed time per iteration (s): 15.21 | learning rate: 2.431E-05 | global batch size: 16 | lm loss: 5.050924E+00 | grad norm: 0.609 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4638/ 128728 | consumed samples: 74208 | consumed tokens: 151977984 | elapsed time per iteration (s): 15.18 | learning rate: 2.432E-05 | global batch size: 16 | lm loss: 5.182425E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4639/ 128728 | consumed samples: 74224 | consumed tokens: 152010752 | elapsed time per iteration (s): 15.22 | learning rate: 2.432E-05 | global batch size: 16 | lm loss: 4.925577E+00 | grad norm: 1.052 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4640/ 128728 | consumed samples: 74240 | consumed tokens: 152043520 | elapsed time per iteration (s): 15.24 | learning rate: 2.433E-05 | global batch size: 16 | lm loss: 5.214342E+00 | grad norm: 1.465 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4641/ 128728 | consumed samples: 74256 | consumed tokens: 152076288 | elapsed time per iteration (s): 15.22 | learning rate: 2.433E-05 | global batch size: 16 | lm loss: 5.029734E+00 | grad norm: 0.682 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4642/ 128728 | consumed samples: 74272 | consumed tokens: 152109056 | elapsed time per iteration (s): 15.20 | learning rate: 2.434E-05 | global batch size: 16 | lm loss: 5.284323E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4643/ 128728 | consumed samples: 74288 | consumed tokens: 152141824 | elapsed time per iteration (s): 15.23 | learning rate: 2.434E-05 | global batch size: 16 | lm loss: 5.124467E+00 | grad norm: 0.783 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4644/ 128728 | consumed samples: 74304 | consumed tokens: 152174592 | elapsed time per iteration (s): 15.25 | learning rate: 2.435E-05 | global batch size: 16 | lm loss: 5.336272E+00 | grad norm: 0.697 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4645/ 128728 | consumed samples: 74320 | consumed tokens: 152207360 | elapsed time per iteration (s): 15.23 | learning rate: 2.435E-05 | global batch size: 16 | lm loss: 5.227530E+00 | grad norm: 0.711 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4646/ 128728 | consumed samples: 74336 | consumed tokens: 152240128 | elapsed time per iteration (s): 15.20 | learning rate: 2.436E-05 | global batch size: 16 | lm loss: 5.086015E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4647/ 128728 | consumed samples: 74352 | consumed tokens: 152272896 | elapsed time per iteration (s): 15.22 | learning rate: 2.436E-05 | global batch size: 16 | lm loss: 5.259191E+00 | grad norm: 0.976 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4648/ 128728 | consumed samples: 74368 | consumed tokens: 152305664 | elapsed time per iteration (s): 15.21 | learning rate: 2.437E-05 | global batch size: 16 | lm loss: 5.258114E+00 | grad norm: 1.670 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4649/ 128728 | consumed samples: 74384 | consumed tokens: 152338432 | elapsed time per iteration (s): 15.24 | learning rate: 2.437E-05 | global batch size: 16 | lm loss: 4.993548E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4650/ 128728 | consumed samples: 74400 | consumed tokens: 152371200 | elapsed time per iteration (s): 15.23 | learning rate: 2.438E-05 | global batch size: 16 | lm loss: 5.435277E+00 | grad norm: 0.817 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4651/ 128728 | consumed samples: 74416 | consumed tokens: 152403968 | elapsed time per iteration (s): 15.17 | learning rate: 2.438E-05 | global batch size: 16 | lm loss: 5.278158E+00 | grad norm: 1.384 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.055 | TFLOPs: 8.08 | [default7]: iteration 4652/ 128728 | consumed samples: 74432 | consumed tokens: 152436736 | elapsed time per iteration (s): 15.24 | learning rate: 2.439E-05 | global batch size: 16 | lm loss: 5.258286E+00 | grad norm: 0.977 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4653/ 128728 | consumed samples: 74448 | consumed tokens: 152469504 | elapsed time per iteration (s): 15.22 | learning rate: 2.440E-05 | global batch size: 16 | lm loss: 5.106120E+00 | grad norm: 0.678 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4654/ 128728 | consumed samples: 74464 | consumed tokens: 152502272 | elapsed time per iteration (s): 15.21 | learning rate: 2.440E-05 | global batch size: 16 | lm loss: 5.292189E+00 | grad norm: 1.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4655/ 128728 | consumed samples: 74480 | consumed tokens: 152535040 | elapsed time per iteration (s): 15.21 | learning rate: 2.441E-05 | global batch size: 16 | lm loss: 5.328452E+00 | grad norm: 0.782 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4656/ 128728 | consumed samples: 74496 | consumed tokens: 152567808 | elapsed time per iteration (s): 15.23 | learning rate: 2.441E-05 | global batch size: 16 | lm loss: 4.942339E+00 | grad norm: 0.959 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4657/ 128728 | consumed samples: 74512 | consumed tokens: 152600576 | elapsed time per iteration (s): 15.19 | learning rate: 2.442E-05 | global batch size: 16 | lm loss: 5.283966E+00 | grad norm: 0.745 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4658/ 128728 | consumed samples: 74528 | consumed tokens: 152633344 | elapsed time per iteration (s): 15.23 | learning rate: 2.442E-05 | global batch size: 16 | lm loss: 5.224200E+00 | grad norm: 1.609 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4659/ 128728 | consumed samples: 74544 | consumed tokens: 152666112 | elapsed time per iteration (s): 15.21 | learning rate: 2.443E-05 | global batch size: 16 | lm loss: 5.286874E+00 | grad norm: 0.766 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4660/ 128728 | consumed samples: 74560 | consumed tokens: 152698880 | elapsed time per iteration (s): 15.24 | learning rate: 2.443E-05 | global batch size: 16 | lm loss: 5.279421E+00 | grad norm: 1.003 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4661/ 128728 | consumed samples: 74576 | consumed tokens: 152731648 | elapsed time per iteration (s): 15.22 | learning rate: 2.444E-05 | global batch size: 16 | lm loss: 5.168081E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4662/ 128728 | consumed samples: 74592 | consumed tokens: 152764416 | elapsed time per iteration (s): 15.22 | learning rate: 2.444E-05 | global batch size: 16 | lm loss: 4.949018E+00 | grad norm: 0.695 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4663/ 128728 | consumed samples: 74608 | consumed tokens: 152797184 | elapsed time per iteration (s): 15.25 | learning rate: 2.445E-05 | global batch size: 16 | lm loss: 5.186502E+00 | grad norm: 4.359 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4664/ 128728 | consumed samples: 74624 | consumed tokens: 152829952 | elapsed time per iteration (s): 15.21 | learning rate: 2.445E-05 | global batch size: 16 | lm loss: 5.038185E+00 | grad norm: 1.049 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4665/ 128728 | consumed samples: 74640 | consumed tokens: 152862720 | elapsed time per iteration (s): 15.20 | learning rate: 2.446E-05 | global batch size: 16 | lm loss: 5.148849E+00 | grad norm: 1.297 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4666/ 128728 | consumed samples: 74656 | consumed tokens: 152895488 | elapsed time per iteration (s): 15.20 | learning rate: 2.446E-05 | global batch size: 16 | lm loss: 5.156718E+00 | grad norm: 0.899 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4667/ 128728 | consumed samples: 74672 | consumed tokens: 152928256 | elapsed time per iteration (s): 15.23 | learning rate: 2.447E-05 | global batch size: 16 | lm loss: 5.175311E+00 | grad norm: 0.968 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4668/ 128728 | consumed samples: 74688 | consumed tokens: 152961024 | elapsed time per iteration (s): 15.20 | learning rate: 2.447E-05 | global batch size: 16 | lm loss: 5.137317E+00 | grad norm: 0.830 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4669/ 128728 | consumed samples: 74704 | consumed tokens: 152993792 | elapsed time per iteration (s): 15.22 | learning rate: 2.448E-05 | global batch size: 16 | lm loss: 5.099137E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4670/ 128728 | consumed samples: 74720 | consumed tokens: 153026560 | elapsed time per iteration (s): 15.24 | learning rate: 2.448E-05 | global batch size: 16 | lm loss: 5.166493E+00 | grad norm: 0.800 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4671/ 128728 | consumed samples: 74736 | consumed tokens: 153059328 | elapsed time per iteration (s): 15.19 | learning rate: 2.449E-05 | global batch size: 16 | lm loss: 5.057539E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4672/ 128728 | consumed samples: 74752 | consumed tokens: 153092096 | elapsed time per iteration (s): 15.22 | learning rate: 2.449E-05 | global batch size: 16 | lm loss: 5.268323E+00 | grad norm: 0.800 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4673/ 128728 | consumed samples: 74768 | consumed tokens: 153124864 | elapsed time per iteration (s): 15.25 | learning rate: 2.450E-05 | global batch size: 16 | lm loss: 4.986012E+00 | grad norm: 0.854 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4674/ 128728 | consumed samples: 74784 | consumed tokens: 153157632 | elapsed time per iteration (s): 15.21 | learning rate: 2.451E-05 | global batch size: 16 | lm loss: 4.991606E+00 | grad norm: 0.783 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4675/ 128728 | consumed samples: 74800 | consumed tokens: 153190400 | elapsed time per iteration (s): 15.22 | learning rate: 2.451E-05 | global batch size: 16 | lm loss: 5.181803E+00 | grad norm: 0.923 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4676/ 128728 | consumed samples: 74816 | consumed tokens: 153223168 | elapsed time per iteration (s): 15.19 | learning rate: 2.452E-05 | global batch size: 16 | lm loss: 5.250779E+00 | grad norm: 0.909 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4677/ 128728 | consumed samples: 74832 | consumed tokens: 153255936 | elapsed time per iteration (s): 15.22 | learning rate: 2.452E-05 | global batch size: 16 | lm loss: 5.169383E+00 | grad norm: 2.261 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4678/ 128728 | consumed samples: 74848 | consumed tokens: 153288704 | elapsed time per iteration (s): 15.21 | learning rate: 2.453E-05 | global batch size: 16 | lm loss: 4.975980E+00 | grad norm: 1.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4679/ 128728 | consumed samples: 74864 | consumed tokens: 153321472 | elapsed time per iteration (s): 15.15 | learning rate: 2.453E-05 | global batch size: 16 | lm loss: 5.177567E+00 | grad norm: 0.895 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.056 | TFLOPs: 8.08 | [default7]: iteration 4680/ 128728 | consumed samples: 74880 | consumed tokens: 153354240 | elapsed time per iteration (s): 15.24 | learning rate: 2.454E-05 | global batch size: 16 | lm loss: 5.231998E+00 | grad norm: 1.587 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4681/ 128728 | consumed samples: 74896 | consumed tokens: 153387008 | elapsed time per iteration (s): 15.22 | learning rate: 2.454E-05 | global batch size: 16 | lm loss: 5.044895E+00 | grad norm: 0.725 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4682/ 128728 | consumed samples: 74912 | consumed tokens: 153419776 | elapsed time per iteration (s): 15.22 | learning rate: 2.455E-05 | global batch size: 16 | lm loss: 5.186457E+00 | grad norm: 0.859 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4683/ 128728 | consumed samples: 74928 | consumed tokens: 153452544 | elapsed time per iteration (s): 15.21 | learning rate: 2.455E-05 | global batch size: 16 | lm loss: 5.240637E+00 | grad norm: 0.679 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.05 | [default7]: iteration 4684/ 128728 | consumed samples: 74944 | consumed tokens: 153485312 | elapsed time per iteration (s): 15.19 | learning rate: 2.456E-05 | global batch size: 16 | lm loss: 5.068531E+00 | grad norm: 0.870 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4685/ 128728 | consumed samples: 74960 | consumed tokens: 153518080 | elapsed time per iteration (s): 15.23 | learning rate: 2.456E-05 | global batch size: 16 | lm loss: 5.105819E+00 | grad norm: 1.225 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4686/ 128728 | consumed samples: 74976 | consumed tokens: 153550848 | elapsed time per iteration (s): 15.22 | learning rate: 2.457E-05 | global batch size: 16 | lm loss: 5.010415E+00 | grad norm: 0.757 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4687/ 128728 | consumed samples: 74992 | consumed tokens: 153583616 | elapsed time per iteration (s): 15.23 | learning rate: 2.457E-05 | global batch size: 16 | lm loss: 5.187891E+00 | grad norm: 0.754 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4688/ 128728 | consumed samples: 75008 | consumed tokens: 153616384 | elapsed time per iteration (s): 15.22 | learning rate: 2.458E-05 | global batch size: 16 | lm loss: 5.193148E+00 | grad norm: 0.779 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4689/ 128728 | consumed samples: 75024 | consumed tokens: 153649152 | elapsed time per iteration (s): 15.22 | learning rate: 2.458E-05 | global batch size: 16 | lm loss: 5.094107E+00 | grad norm: 1.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4690/ 128728 | consumed samples: 75040 | consumed tokens: 153681920 | elapsed time per iteration (s): 15.23 | learning rate: 2.459E-05 | global batch size: 16 | lm loss: 5.274774E+00 | grad norm: 0.732 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4691/ 128728 | consumed samples: 75056 | consumed tokens: 153714688 | elapsed time per iteration (s): 15.24 | learning rate: 2.459E-05 | global batch size: 16 | lm loss: 5.135740E+00 | grad norm: 4.086 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4692/ 128728 | consumed samples: 75072 | consumed tokens: 153747456 | elapsed time per iteration (s): 15.23 | learning rate: 2.460E-05 | global batch size: 16 | lm loss: 5.176250E+00 | grad norm: 0.887 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.04 | [default7]: iteration 4693/ 128728 | consumed samples: 75088 | consumed tokens: 153780224 | elapsed time per iteration (s): 15.24 | learning rate: 2.460E-05 | global batch size: 16 | lm loss: 5.101857E+00 | grad norm: 1.251 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.050 | TFLOPs: 8.04 | [default7]: iteration 4694/ 128728 | consumed samples: 75104 | consumed tokens: 153812992 | elapsed time per iteration (s): 15.26 | learning rate: 2.461E-05 | global batch size: 16 | lm loss: 5.162605E+00 | grad norm: 1.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.048 | TFLOPs: 8.03 | [default7]: iteration 4695/ 128728 | consumed samples: 75120 | consumed tokens: 153845760 | elapsed time per iteration (s): 15.18 | learning rate: 2.462E-05 | global batch size: 16 | lm loss: 5.055284E+00 | grad norm: 1.294 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.054 | TFLOPs: 8.07 | [default7]: iteration 4696/ 128728 | consumed samples: 75136 | consumed tokens: 153878528 | elapsed time per iteration (s): 15.26 | learning rate: 2.462E-05 | global batch size: 16 | lm loss: 5.240335E+00 | grad norm: 1.086 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default7]: iteration 4697/ 128728 | consumed samples: 75152 | consumed tokens: 153911296 | elapsed time per iteration (s): 15.19 | learning rate: 2.463E-05 | global batch size: 16 | lm loss: 5.174578E+00 | grad norm: 0.762 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4698/ 128728 | consumed samples: 75168 | consumed tokens: 153944064 | elapsed time per iteration (s): 15.20 | learning rate: 2.463E-05 | global batch size: 16 | lm loss: 5.296353E+00 | grad norm: 0.696 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.053 | TFLOPs: 8.06 | [default7]: iteration 4699/ 128728 | consumed samples: 75184 | consumed tokens: 153976832 | elapsed time per iteration (s): 15.22 | learning rate: 2.464E-05 | global batch size: 16 | lm loss: 5.088926E+00 | grad norm: 1.058 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4700/ 128728 | consumed samples: 75200 | consumed tokens: 154009600 | elapsed time per iteration (s): 15.22 | learning rate: 2.464E-05 | global batch size: 16 | lm loss: 5.067006E+00 | grad norm: 0.811 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4701/ 128728 | consumed samples: 75216 | consumed tokens: 154042368 | elapsed time per iteration (s): 15.23 | learning rate: 2.465E-05 | global batch size: 16 | lm loss: 5.168745E+00 | grad norm: 0.649 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4702/ 128728 | consumed samples: 75232 | consumed tokens: 154075136 | elapsed time per iteration (s): 15.20 | learning rate: 2.465E-05 | global batch size: 16 | lm loss: 5.318768E+00 | grad norm: 2.551 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.052 | TFLOPs: 8.06 | [default7]: iteration 4703/ 128728 | consumed samples: 75248 | consumed tokens: 154107904 | elapsed time per iteration (s): 15.23 | learning rate: 2.466E-05 | global batch size: 16 | lm loss: 5.285171E+00 | grad norm: 0.686 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.051 | TFLOPs: 8.05 | [default7]: iteration 4704/ 128728 | consumed samples: 75264 | consumed tokens: 154140672 | elapsed time per iteration (s): 15.25 | learning rate: 2.466E-05 | global batch size: 16 | lm loss: 5.164697E+00 | grad norm: 1.015 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.049 | TFLOPs: 8.03 | [default0]:saving checkpoint at iteration 4704 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default1]:[2022-03-04 01:55:13,827] [INFO] [logging.py:69:log_dist] [Rank 1] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/mp_rank_01_model_states.pt [default0]:[2022-03-04 01:55:14,066] [INFO] [logging.py:69:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/mp_rank_00_model_states.pt [default2]:[2022-03-04 01:55:28,793] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_42_optim_states.pt [default4]:[2022-03-04 01:55:28,882] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_40_optim_states.pt [default1]:[2022-03-04 01:55:29,016] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_41_optim_states.pt [default6]:[2022-03-04 01:55:29,059] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_42_optim_states.pt [default3]:[2022-03-04 01:55:29,184] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_43_optim_states.pt [default5]:[2022-03-04 01:55:29,255] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_41_optim_states.pt [default7]:[2022-03-04 01:55:29,250] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_43_optim_states.pt [default7]:[2022-03-04 01:55:29,306] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_43_optim_states.pt [default0]:[2022-03-04 01:55:29,433] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_40_optim_states.pt [default0]:[2022-03-04 01:55:29,564] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_16_optim_states.pt [default6]:[2022-03-04 01:55:29,953] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_42_optim_states.pt [default4]:[2022-03-04 01:55:29,969] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_16_optim_states.pt [default5]:[2022-03-04 01:55:29,907] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_17_optim_states.pt [default1]:[2022-03-04 01:55:29,909] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_37_optim_states.pt [default2]:[2022-03-04 01:55:30,132] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_18_optim_states.pt [default5]:[2022-03-04 01:55:30,128] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_13_optim_states.pt [default6]:[2022-03-04 01:55:30,244] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_18_optim_states.pt [default4]:[2022-03-04 01:55:30,128] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_12_optim_states.pt [default1]:[2022-03-04 01:55:30,271] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_17_optim_states.pt [default7]:[2022-03-04 01:55:30,473] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_27_optim_states.pt [default0]:[2022-03-04 01:55:30,452] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_16_optim_states.pt [default1]:[2022-03-04 01:55:30,560] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_25_optim_states.pt [default3]:[2022-03-04 01:55:30,558] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_15_optim_states.pt [default7]:[2022-03-04 01:55:30,639] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_19_optim_states.pt [default4]:[2022-03-04 01:55:30,666] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_24_optim_states.pt [default7]:[2022-03-04 01:55:30,597] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_15_optim_states.pt [default0]:[2022-03-04 01:55:30,722] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_24_optim_states.pt [default3]:[2022-03-04 01:55:30,722] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_19_optim_states.pt [default4]:[2022-03-04 01:55:30,834] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_40_optim_states.pt [default6]:[2022-03-04 01:55:30,863] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_26_optim_states.pt [default1]:[2022-03-04 01:55:30,921] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_41_optim_states.pt [default5]:[2022-03-04 01:55:30,896] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_25_optim_states.pt [default2]:[2022-03-04 01:55:31,008] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_26_optim_states.pt [default3]:[2022-03-04 01:55:30,949] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_27_optim_states.pt [default6]:[2022-03-04 01:55:31,005] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_34_optim_states.pt [default1]:[2022-03-04 01:55:31,094] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_33_optim_states.pt [default0]:[2022-03-04 01:55:31,135] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [default2]:[2022-03-04 01:55:31,167] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_38_optim_states.pt [default0]:[2022-03-04 01:55:31,164] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_32_optim_states.pt [default3]:[2022-03-04 01:55:31,221] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_35_optim_states.pt [default3]:[2022-03-04 01:55:31,172] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_39_optim_states.pt [default5]:[2022-03-04 01:55:31,207] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_41_optim_states.pt [default2]:[2022-03-04 01:55:31,322] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_42_optim_states.pt [default5]:[2022-03-04 01:55:31,315] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_33_optim_states.pt [default3]:[2022-03-04 01:55:31,324] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_43_optim_states.pt [default6]:[2022-03-04 01:55:31,350] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_38_optim_states.pt [default7]:[2022-03-04 01:55:31,395] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_35_optim_states.pt [default1]:[2022-03-04 01:55:31,401] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_01_optim_states.pt [default4]:[2022-03-04 01:55:31,417] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_32_optim_states.pt [default4]:[2022-03-04 01:55:31,418] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_36_optim_states.pt [default2]:[2022-03-04 01:55:31,531] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_14_optim_states.pt [default7]:[2022-03-04 01:55:31,537] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_03_optim_states.pt [default0]:[2022-03-04 01:55:31,474] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_12_optim_states.pt [default2]:[2022-03-04 01:55:31,532] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_34_optim_states.pt [default6]:[2022-03-04 01:55:31,537] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_14_optim_states.pt [default1]:[2022-03-04 01:55:31,573] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_13_optim_states.pt [default2]:[2022-03-04 01:55:31,568] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_18_optim_states.pt [default7]:[2022-03-04 01:55:31,650] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_15_optim_states.pt [default6]:[2022-03-04 01:55:31,630] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_14_optim_states.pt [default0]:[2022-03-04 01:55:31,593] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_40_optim_states.pt [default6]:[2022-03-04 01:55:31,620] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_02_optim_states.pt [default0]:[2022-03-04 01:55:31,678] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_08_optim_states.pt [default0]:[2022-03-04 01:55:31,737] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_08_optim_states.pt [default1]:[2022-03-04 01:55:31,669] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_13_optim_states.pt [default5]:[2022-03-04 01:55:31,753] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_37_optim_states.pt [default0]:[2022-03-04 01:55:31,733] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_12_optim_states.pt [default4]:[2022-03-04 01:55:31,845] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_12_optim_states.pt [default5]:[2022-03-04 01:55:31,790] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_01_optim_states.pt [default4]:[2022-03-04 01:55:31,944] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt [default5]:[2022-03-04 01:55:31,937] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_13_optim_states.pt [default4]:[2022-03-04 01:55:31,907] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_40_optim_states.pt [default6]:[2022-03-04 01:55:32,006] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_22_optim_states.pt [default3]:[2022-03-04 01:55:31,946] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_43_optim_states.pt [default1]:[2022-03-04 01:55:32,032] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_09_optim_states.pt [default1]:[2022-03-04 01:55:32,093] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_17_optim_states.pt [default0]:[2022-03-04 01:55:32,090] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_20_optim_states.pt [default3]:[2022-03-04 01:55:32,109] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_03_optim_states.pt [default7]:[2022-03-04 01:55:32,088] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_39_optim_states.pt [default1]:[2022-03-04 01:55:32,142] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_09_optim_states.pt [default6]:[2022-03-04 01:55:32,155] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_38_optim_states.pt [default4]:[2022-03-04 01:55:32,167] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_36_optim_states.pt [default5]:[2022-03-04 01:55:32,237] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_37_optim_states.pt [default7]:[2022-03-04 01:55:32,168] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_39_optim_states.pt [default3]:[2022-03-04 01:55:32,183] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_15_optim_states.pt [default0]:[2022-03-04 01:55:32,187] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_36_optim_states.pt [default4]:[2022-03-04 01:55:32,335] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_20_optim_states.pt [default2]:[2022-03-04 01:55:32,444] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_02_optim_states.pt [default1]:[2022-03-04 01:55:32,558] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_37_optim_states.pt [default5]:[2022-03-04 01:55:32,592] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_09_optim_states.pt [default4]:[2022-03-04 01:55:32,592] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_08_optim_states.pt [default7]:[2022-03-04 01:55:32,612] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_23_optim_states.pt [default0]:[2022-03-04 01:55:32,661] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_36_optim_states.pt [default2]:[2022-03-04 01:55:32,679] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_14_optim_states.pt [default4]:[2022-03-04 01:55:32,837] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt [default2]:[2022-03-04 01:55:33,100] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_38_optim_states.pt [default3]:[2022-03-04 01:55:33,196] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_19_optim_states.pt [default3]:[2022-03-04 01:55:33,196] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_39_optim_states.pt [default7]:[2022-03-04 01:55:33,242] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_11_optim_states.pt [default6]:[2022-03-04 01:55:33,229] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_10_optim_states.pt [default7]:[2022-03-04 01:55:33,301] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_35_optim_states.pt [default0]:[2022-03-04 01:55:33,327] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_32_optim_states.pt [default1]:[2022-03-04 01:55:33,327] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_21_optim_states.pt [default2]:[2022-03-04 01:55:33,341] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_42_optim_states.pt [default4]:[2022-03-04 01:55:33,353] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_16_optim_states.pt [default0]:[2022-03-04 01:55:33,439] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_40_optim_states.pt [default5]:[2022-03-04 01:55:33,454] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_45_optim_states.pt [default5]:[2022-03-04 01:55:33,482] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_41_optim_states.pt [default4]:[2022-03-04 01:55:33,451] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_44_optim_states.pt [default3]:[2022-03-04 01:55:33,658] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_31_optim_states.pt [default2]:[2022-03-04 01:55:33,580] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_34_optim_states.pt [default2]:[2022-03-04 01:55:33,598] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_26_optim_states.pt [default5]:[2022-03-04 01:55:33,668] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_21_optim_states.pt [default7]:[2022-03-04 01:55:33,693] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_23_optim_states.pt [default7]:[2022-03-04 01:55:33,745] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_31_optim_states.pt [default6]:[2022-03-04 01:55:33,697] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_18_optim_states.pt [default7]:[2022-03-04 01:55:33,719] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_19_optim_states.pt [default1]:[2022-03-04 01:55:33,752] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_01_optim_states.pt [default2]:[2022-03-04 01:55:33,749] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_30_optim_states.pt [default1]:[2022-03-04 01:55:33,802] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_25_optim_states.pt [default3]:[2022-03-04 01:55:33,794] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_31_optim_states.pt [default6]:[2022-03-04 01:55:33,745] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_30_optim_states.pt [default6]:[2022-03-04 01:55:33,836] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_18_optim_states.pt [default4]:[2022-03-04 01:55:33,814] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_20_optim_states.pt [default3]:[2022-03-04 01:55:33,935] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_47_optim_states.pt [default4]:[2022-03-04 01:55:33,922] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_40_optim_states.pt [default0]:[2022-03-04 01:55:33,973] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_28_optim_states.pt [default6]:[2022-03-04 01:55:33,984] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_22_optim_states.pt [default1]:[2022-03-04 01:55:33,991] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_33_optim_states.pt [default1]:[2022-03-04 01:55:33,995] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_41_optim_states.pt [default6]:[2022-03-04 01:55:34,077] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_10_optim_states.pt [default3]:[2022-03-04 01:55:34,128] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_11_optim_states.pt [default5]:[2022-03-04 01:55:34,178] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_05_optim_states.pt [default2]:[2022-03-04 01:55:34,199] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_10_optim_states.pt [default3]:[2022-03-04 01:55:34,171] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_19_optim_states.pt [default2]:[2022-03-04 01:55:34,171] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_18_optim_states.pt [default5]:[2022-03-04 01:55:34,200] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_17_optim_states.pt [default6]:[2022-03-04 01:55:34,237] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_34_optim_states.pt [default2]:[2022-03-04 01:55:34,212] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_06_optim_states.pt [default2]:[2022-03-04 01:55:34,275] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_22_optim_states.pt [default5]:[2022-03-04 01:55:34,325] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_33_optim_states.pt [default2]:[2022-03-04 01:55:34,348] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_26_optim_states.pt [default4]:[2022-03-04 01:55:34,330] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt [default6]:[2022-03-04 01:55:34,300] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_46_optim_states.pt [default4]:[2022-03-04 01:55:34,356] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_04_optim_states.pt [default3]:[2022-03-04 01:55:34,311] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_07_optim_states.pt [default7]:[2022-03-04 01:55:34,351] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_11_optim_states.pt [default7]:[2022-03-04 01:55:34,414] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_11_optim_states.pt [default5]:[2022-03-04 01:55:34,418] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_29_optim_states.pt [default2]:[2022-03-04 01:55:34,375] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_10_optim_states.pt [default3]:[2022-03-04 01:55:34,435] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_27_optim_states.pt [default5]:[2022-03-04 01:55:34,500] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_01_optim_states.pt [default0]:[2022-03-04 01:55:34,478] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_44_optim_states.pt [default1]:[2022-03-04 01:55:34,485] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_33_optim_states.pt [default3]:[2022-03-04 01:55:34,508] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_11_optim_states.pt [default4]:[2022-03-04 01:55:34,508] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_16_optim_states.pt [default3]:[2022-03-04 01:55:34,605] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_15_optim_states.pt [default1]:[2022-03-04 01:55:34,534] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_41_optim_states.pt [default1]:[2022-03-04 01:55:34,623] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_29_optim_states.pt [default0]:[2022-03-04 01:55:34,571] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt [default2]:[2022-03-04 01:55:34,652] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_14_optim_states.pt [default0]:[2022-03-04 01:55:34,615] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_28_optim_states.pt [default6]:[2022-03-04 01:55:34,704] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_42_optim_states.pt [default3]:[2022-03-04 01:55:34,684] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_23_optim_states.pt [default2]:[2022-03-04 01:55:34,695] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_30_optim_states.pt [default3]:[2022-03-04 01:55:34,719] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_35_optim_states.pt [default1]:[2022-03-04 01:55:34,732] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_29_optim_states.pt [default6]:[2022-03-04 01:55:34,729] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_42_optim_states.pt [default1]:[2022-03-04 01:55:34,835] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_29_optim_states.pt [default5]:[2022-03-04 01:55:34,856] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_41_optim_states.pt [default2]:[2022-03-04 01:55:34,766] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_34_optim_states.pt [default7]:[2022-03-04 01:55:34,841] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_43_optim_states.pt [default0]:[2022-03-04 01:55:34,867] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_40_optim_states.pt [default5]:[2022-03-04 01:55:34,882] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt [default4]:[2022-03-04 01:55:34,917] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_28_optim_states.pt [default5]:[2022-03-04 01:55:34,921] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_17_optim_states.pt [default7]:[2022-03-04 01:55:34,848] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_43_optim_states.pt [default3]:[2022-03-04 01:55:34,854] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_03_optim_states.pt [default3]:[2022-03-04 01:55:34,864] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_23_optim_states.pt [default1]:[2022-03-04 01:55:34,997] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_17_optim_states.pt [default4]:[2022-03-04 01:55:34,965] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_32_optim_states.pt [default7]:[2022-03-04 01:55:35,021] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_15_optim_states.pt [default5]:[2022-03-04 01:55:35,027] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_21_optim_states.pt [default7]:[2022-03-04 01:55:35,062] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_19_optim_states.pt [default0]:[2022-03-04 01:55:35,034] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_16_optim_states.pt [default3]:[2022-03-04 01:55:35,113] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_27_optim_states.pt [default3]:[2022-03-04 01:55:35,116] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_43_optim_states.pt [default2]:[2022-03-04 01:55:35,044] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_22_optim_states.pt [default0]:[2022-03-04 01:55:35,118] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_12_optim_states.pt [default0]:[2022-03-04 01:55:35,119] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_24_optim_states.pt [default5]:[2022-03-04 01:55:35,100] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_13_optim_states.pt [default1]:[2022-03-04 01:55:35,213] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_01_optim_states.pt [default2]:[2022-03-04 01:55:35,216] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_06_optim_states.pt [default6]:[2022-03-04 01:55:35,237] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_14_optim_states.pt [default1]:[2022-03-04 01:55:35,196] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_13_optim_states.pt [default0]:[2022-03-04 01:55:35,299] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_32_optim_states.pt [default3]:[2022-03-04 01:55:35,278] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_23_optim_states.pt [default0]:[2022-03-04 01:55:35,349] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt [default2]:[2022-03-04 01:55:35,290] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_42_optim_states.pt [default7]:[2022-03-04 01:55:35,342] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_35_optim_states.pt [default2]:[2022-03-04 01:55:35,363] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_46_optim_states.pt [default2]:[2022-03-04 01:55:35,351] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_22_optim_states.pt [default4]:[2022-03-04 01:55:35,409] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_12_optim_states.pt [default0]:[2022-03-04 01:55:35,390] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_04_optim_states.pt [default1]:[2022-03-04 01:55:35,518] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_21_optim_states.pt [default0]:[2022-03-04 01:55:35,508] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_20_optim_states.pt [default5]:[2022-03-04 01:55:35,504] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_13_optim_states.pt [default1]:[2022-03-04 01:55:35,530] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_45_optim_states.pt [default4]:[2022-03-04 01:55:35,615] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_32_optim_states.pt [default2]:[2022-03-04 01:55:35,561] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_02_optim_states.pt [default6]:[2022-03-04 01:55:35,626] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_14_optim_states.pt [default6]:[2022-03-04 01:55:35,695] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_02_optim_states.pt [default3]:[2022-03-04 01:55:35,716] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_03_optim_states.pt [default5]:[2022-03-04 01:55:35,748] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_05_optim_states.pt [default1]:[2022-03-04 01:55:35,735] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_13_optim_states.pt [default4]:[2022-03-04 01:55:35,797] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_04_optim_states.pt [default4]:[2022-03-04 01:55:35,773] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_12_optim_states.pt [default1]:[2022-03-04 01:55:35,831] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_05_optim_states.pt [default3]:[2022-03-04 01:55:35,836] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_35_optim_states.pt [default3]:[2022-03-04 01:55:35,966] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_07_optim_states.pt [default5]:[2022-03-04 01:55:35,977] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_29_optim_states.pt [default7]:[2022-03-04 01:55:35,978] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_07_optim_states.pt [default6]:[2022-03-04 01:55:35,964] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_06_optim_states.pt [default3]:[2022-03-04 01:55:36,006] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_15_optim_states.pt [default0]:[2022-03-04 01:55:36,079] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_28_optim_states.pt [default4]:[2022-03-04 01:55:36,107] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_08_optim_states.pt [default5]:[2022-03-04 01:55:36,144] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_09_optim_states.pt [default7]:[2022-03-04 01:55:36,189] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_15_optim_states.pt [default0]:[2022-03-04 01:55:36,237] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_12_optim_states.pt [default5]:[2022-03-04 01:55:36,277] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_25_optim_states.pt [default6]:[2022-03-04 01:55:36,235] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_02_optim_states.pt [default7]:[2022-03-04 01:55:36,298] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_03_optim_states.pt [default5]:[2022-03-04 01:55:36,285] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_33_optim_states.pt [default2]:[2022-03-04 01:55:36,321] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_02_optim_states.pt [default3]:[2022-03-04 01:55:36,408] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_23_optim_states.pt [default7]:[2022-03-04 01:55:36,360] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_47_optim_states.pt [default4]:[2022-03-04 01:55:36,363] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_24_optim_states.pt [default7]:[2022-03-04 01:55:36,417] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_03_optim_states.pt [default6]:[2022-03-04 01:55:36,503] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_34_optim_states.pt [default4]:[2022-03-04 01:55:36,510] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_08_optim_states.pt [default4]:[2022-03-04 01:55:36,623] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_16_optim_states.pt [default2]:[2022-03-04 01:55:36,647] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_22_optim_states.pt [default2]:[2022-03-04 01:55:36,640] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_14_optim_states.pt [default5]:[2022-03-04 01:55:36,741] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_09_optim_states.pt [default2]:[2022-03-04 01:55:36,719] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_10_optim_states.pt [default4]:[2022-03-04 01:55:36,789] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_24_optim_states.pt [default6]:[2022-03-04 01:55:36,836] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_10_optim_states.pt [default5]:[2022-03-04 01:55:36,854] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_01_optim_states.pt [default5]:[2022-03-04 01:55:36,844] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_17_optim_states.pt [default6]:[2022-03-04 01:55:36,850] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_22_optim_states.pt [default5]:[2022-03-04 01:55:36,923] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_25_optim_states.pt [default7]:[2022-03-04 01:55:36,956] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_23_optim_states.pt [default2]:[2022-03-04 01:55:36,958] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_10_optim_states.pt [default7]:[2022-03-04 01:55:37,002] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_31_optim_states.pt [default0]:[2022-03-04 01:55:37,018] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_04_optim_states.pt [default5]:[2022-03-04 01:55:37,043] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_29_optim_states.pt [default3]:[2022-03-04 01:55:37,021] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_11_optim_states.pt [default2]:[2022-03-04 01:55:37,070] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_30_optim_states.pt [default4]:[2022-03-04 01:55:37,050] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_28_optim_states.pt [default7]:[2022-03-04 01:55:37,125] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_47_optim_states.pt [default6]:[2022-03-04 01:55:37,225] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_30_optim_states.pt [default7]:[2022-03-04 01:55:37,213] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_27_optim_states.pt [default2]:[2022-03-04 01:55:37,257] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_46_optim_states.pt [default4]:[2022-03-04 01:55:37,261] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt [default6]:[2022-03-04 01:55:37,311] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_46_optim_states.pt [default7]:[2022-03-04 01:55:37,426] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_07_optim_states.pt [default1]:[2022-03-04 01:55:37,595] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_05_optim_states.pt [default7]:[2022-03-04 01:55:37,704] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_47_optim_states.pt [default2]:[2022-03-04 01:55:37,668] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_34_optim_states.pt [default1]:[2022-03-04 01:55:37,692] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_17_optim_states.pt [default6]:[2022-03-04 01:55:37,869] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_26_optim_states.pt [default1]:[2022-03-04 01:55:37,883] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_25_optim_states.pt [default5]:[2022-03-04 01:55:38,074] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_45_optim_states.pt [default6]:[2022-03-04 01:55:38,063] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_06_optim_states.pt [default4]:[2022-03-04 01:55:38,038] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_20_optim_states.pt [default2]:[2022-03-04 01:55:38,041] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_46_optim_states.pt [default4]:[2022-03-04 01:55:38,175] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_20_optim_states.pt [default5]:[2022-03-04 01:55:38,205] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_21_optim_states.pt [default5]:[2022-03-04 01:55:38,137] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_21_optim_states.pt [default3]:[2022-03-04 01:55:38,277] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_47_optim_states.pt [default7]:[2022-03-04 01:55:38,197] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_27_optim_states.pt [default0]:[2022-03-04 01:55:38,258] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_20_optim_states.pt [default1]:[2022-03-04 01:55:38,313] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_21_optim_states.pt [default7]:[2022-03-04 01:55:38,346] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_31_optim_states.pt [default0]:[2022-03-04 01:55:38,410] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_04_optim_states.pt [default0]:[2022-03-04 01:55:38,409] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_36_optim_states.pt [default0]:[2022-03-04 01:55:38,468] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_24_optim_states.pt [default4]:[2022-03-04 01:55:38,534] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_36_optim_states.pt [default7]:[2022-03-04 01:55:38,509] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_39_optim_states.pt [default0]:[2022-03-04 01:55:38,525] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_36_optim_states.pt [default5]:[2022-03-04 01:55:38,714] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_37_optim_states.pt [default0]:[2022-03-04 01:55:38,688] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_44_optim_states.pt [default0]:[2022-03-04 01:55:38,723] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_04_optim_states.pt [default6]:[2022-03-04 01:55:38,774] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_26_optim_states.pt [default1]:[2022-03-04 01:55:38,811] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_05_optim_states.pt [default0]:[2022-03-04 01:55:38,763] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_16_optim_states.pt [default7]:[2022-03-04 01:55:38,767] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_07_optim_states.pt [default6]:[2022-03-04 01:55:38,930] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_10_optim_states.pt [default6]:[2022-03-04 01:55:39,028] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_18_optim_states.pt [default6]:[2022-03-04 01:55:39,051] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_46_optim_states.pt [default7]:[2022-03-04 01:55:39,014] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_47_optim_states.pt [default3]:[2022-03-04 01:55:39,083] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_47_optim_states.pt [default7]:[2022-03-04 01:55:39,071] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_19_optim_states.pt [default3]:[2022-03-04 01:55:39,189] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_19_optim_states.pt [default1]:[2022-03-04 01:55:39,163] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_37_optim_states.pt [default1]:[2022-03-04 01:55:39,258] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_05_optim_states.pt [default6]:[2022-03-04 01:55:39,203] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_38_optim_states.pt [default3]:[2022-03-04 01:55:39,317] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_07_optim_states.pt [default4]:[2022-03-04 01:55:39,312] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_44_optim_states.pt [default6]:[2022-03-04 01:55:39,351] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_06_optim_states.pt [default2]:[2022-03-04 01:55:39,462] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_18_optim_states.pt [default2]:[2022-03-04 01:55:39,493] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_02_optim_states.pt [default7]:[2022-03-04 01:55:39,455] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_11_optim_states.pt [default3]:[2022-03-04 01:55:39,490] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_35_optim_states.pt [default6]:[2022-03-04 01:55:39,433] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_46_optim_states.pt [default3]:[2022-03-04 01:55:39,496] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_03_optim_states.pt [default1]:[2022-03-04 01:55:39,633] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_25_optim_states.pt [default2]:[2022-03-04 01:55:39,621] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_38_optim_states.pt [default4]:[2022-03-04 01:55:39,628] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_36_optim_states.pt [default3]:[2022-03-04 01:55:39,630] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_39_optim_states.pt [default6]:[2022-03-04 01:55:39,684] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_22_optim_states.pt [default2]:[2022-03-04 01:55:39,677] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_26_optim_states.pt [default4]:[2022-03-04 01:55:39,721] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_44_optim_states.pt [default3]:[2022-03-04 01:55:39,658] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_27_optim_states.pt [default5]:[2022-03-04 01:55:39,763] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_37_optim_states.pt [default7]:[2022-03-04 01:55:39,837] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_23_optim_states.pt [default5]:[2022-03-04 01:55:39,808] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_45_optim_states.pt [default4]:[2022-03-04 01:55:39,867] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_32_optim_states.pt [default1]:[2022-03-04 01:55:39,959] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_45_optim_states.pt [default1]:[2022-03-04 01:55:39,912] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_37_optim_states.pt [default6]:[2022-03-04 01:55:40,193] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_06_optim_states.pt [default6]:[2022-03-04 01:55:40,160] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_30_optim_states.pt [default7]:[2022-03-04 01:55:40,239] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_07_optim_states.pt [default1]:[2022-03-04 01:55:40,230] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_45_optim_states.pt [default0]:[2022-03-04 01:55:40,349] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_44_optim_states.pt [default3]:[2022-03-04 01:55:40,390] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_31_optim_states.pt [default2]:[2022-03-04 01:55:40,360] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_46_optim_states.pt [default3]:[2022-03-04 01:55:40,341] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_47_optim_states.pt [default3]:[2022-03-04 01:55:40,589] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_07_optim_states.pt [default2]:[2022-03-04 01:55:40,577] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_06_optim_states.pt [default2]:[2022-03-04 01:55:40,564] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_30_optim_states.pt [default4]:[2022-03-04 01:55:40,731] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_28_optim_states.pt [default3]:[2022-03-04 01:55:40,782] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_31_optim_states.pt [default0]:[2022-03-04 01:55:40,851] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt [default1]:[2022-03-04 01:55:40,941] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_01_optim_states.pt [default6]:[2022-03-04 01:55:41,166] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_30_optim_states.pt [default5]:[2022-03-04 01:55:41,216] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_09_optim_states.pt [default5]:[2022-03-04 01:55:41,244] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_33_optim_states.pt [default3]:[2022-03-04 01:55:41,368] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_11_optim_states.pt [default4]:[2022-03-04 01:55:41,420] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_08_optim_states.pt [default3]:[2022-03-04 01:55:41,488] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_39_optim_states.pt [default5]:[2022-03-04 01:55:41,494] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_05_optim_states.pt [default4]:[2022-03-04 01:55:41,532] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_04_optim_states.pt [default2]:[2022-03-04 01:55:41,552] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_6_mp_rank_06_optim_states.pt [default6]:[2022-03-04 01:55:41,573] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_38_optim_states.pt [default0]:[2022-03-04 01:55:41,701] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_08_optim_states.pt [default1]:[2022-03-04 01:55:41,792] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_21_optim_states.pt [default7]:[2022-03-04 01:55:41,746] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_31_optim_states.pt [default0]:[2022-03-04 01:55:41,830] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_20_optim_states.pt [default7]:[2022-03-04 01:55:41,937] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_39_optim_states.pt [default1]:[2022-03-04 01:55:42,041] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_09_optim_states.pt [default1]:[2022-03-04 01:55:42,069] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_29_optim_states.pt [default2]:[2022-03-04 01:55:42,096] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_38_optim_states.pt [default0]:[2022-03-04 01:55:42,067] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_24_optim_states.pt [default0]:[2022-03-04 01:55:42,068] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_2_mp_rank_08_optim_states.pt [default0]:[2022-03-04 01:55:42,286] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_28_optim_states.pt [default1]:[2022-03-04 01:55:42,349] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_09_optim_states.pt [default4]:[2022-03-04 01:55:42,511] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_28_optim_states.pt [default5]:[2022-03-04 01:55:42,591] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_29_optim_states.pt [default5]:[2022-03-04 01:55:42,766] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_25_optim_states.pt [default0]:[2022-03-04 01:55:42,839] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_44_optim_states.pt [default6]:[2022-03-04 01:55:42,948] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_02_optim_states.pt [default1]:[2022-03-04 01:55:43,046] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_4_mp_rank_45_optim_states.pt [default7]:[2022-03-04 01:55:43,457] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_3_mp_rank_03_optim_states.pt [default4]:[2022-03-04 01:55:44,244] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_24_optim_states.pt [default6]:[2022-03-04 01:55:44,589] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_34_optim_states.pt [default7]:[2022-03-04 01:55:44,606] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_1_mp_rank_35_optim_states.pt [default0]:[2022-03-04 01:55:44,809] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_32_optim_states.pt [default1]:[2022-03-04 01:55:44,832] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_0_mp_rank_33_optim_states.pt [default5]:[2022-03-04 01:55:45,596] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_05_optim_states.pt [default4]:[2022-03-04 01:55:45,568] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_7_mp_rank_04_optim_states.pt [default6]:[2022-03-04 01:55:45,997] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_26_optim_states.pt [default7]:[2022-03-04 01:55:46,035] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_27_optim_states.pt [default4]:[2022-03-04 01:55:51,209] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_44_optim_states.pt [default5]:[2022-03-04 01:55:51,246] [INFO] [engine.py:3077:_save_zero_checkpoint] bfl6_zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints/global_step4704/bf16_zero_pp_rank_5_mp_rank_45_optim_states.pt [default0]: successfully saved checkpoint at iteration 4704 to /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default0]:[exiting program after 1190.0250424226126 minutes] datetime: 2022-03-04 01:55:51 [default7]:time (ms) | save-checkpoint: 44797.43 WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** [default7]:> setting tensorboard ... [default0]:Offline mode: forcing local_files_only=True [default0]:using world size: 384, data-parallel-size: 8, tensor-model-parallel size: 4, pipeline-model-parallel size: 12 [default0]:WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer with tokenizer_type:PretrainedFromHF [default0]:accumulate and all-reduce gradients in fp32 for bfloat16 data type. [default0]:using torch.bfloat16 for parameters ... [default0]:------------------------ arguments ------------------------ [default0]: abort_on_unmet_fused_kernel_constraints ......... True [default0]: accumulate_allreduce_grads_in_fp32 .............. True [default0]: adam_beta1 ...................................... 0.9 [default0]: adam_beta2 ...................................... 0.95 [default0]: adam_eps ........................................ 1e-08 [default0]: adlr_autoresume ................................. False [default0]: adlr_autoresume_interval ........................ 1000 [default0]: apply_query_key_layer_scaling ................... True [default0]: apply_residual_connection_post_layernorm ........ False [default0]: attention_dropout ............................... 0.1 [default0]: attention_softmax_in_fp32 ....................... False [default0]: bert_binary_head ................................ True [default0]: bert_load ....................................... None [default0]: bf16 ............................................ True [default0]: bias_dropout_fusion ............................. True [default0]: bias_gelu_fusion ................................ True [default0]: biencoder_projection_dim ........................ 0 [default0]: biencoder_shared_query_context_model ............ False [default0]: block_data_path ................................. None [default0]: checkpoint_activations .......................... True [default0]: checkpoint_in_cpu ............................... False [default0]: checkpoint_num_layers ........................... 1 [default0]: clip_grad ....................................... 1.0 [default0]: codecarbon_dir .................................. None [default0]: consumed_train_samples .......................... 0 [default0]: consumed_train_tokens ........................... 0 [default0]: consumed_valid_samples .......................... 0 [default0]: contigious_checkpointing ........................ False [default0]: cpu_optimizer ................................... False [default0]: cpu_torch_adam .................................. False [default0]: curriculum_learning ............................. False [default0]: data_impl ....................................... mmap [default0]: data_parallel_size .............................. 8 [default0]: data_path ....................................... None [default0]: dataloader_type ................................. single [default0]: DDP_impl ........................................ local [default0]: decoder_seq_length .............................. None [default0]: deepscale ....................................... False [default0]: deepscale_config ................................ None [default0]: deepspeed ....................................... True [default0]: deepspeed_activation_checkpointing .............. True [default0]: deepspeed_config ................................ ./ds_config.202316.json [default0]: deepspeed_mpi ................................... False [default0]: distribute_checkpointed_activations ............. False [default0]: distributed_backend ............................. nccl [default0]: embed_layernorm ................................. True [default0]: embedding_path .................................. None [default0]: encoder_seq_length .............................. 2048 [default0]: eod_mask_loss ................................... False [default0]: eval_interval ................................... 1000 [default0]: eval_iters ...................................... 10 [default0]: eval_only ....................................... None [default0]: evidence_data_path .............................. None [default0]: exit_duration_in_mins ........................... 5990 [default0]: exit_interval ................................... None [default0]: ffn_hidden_size ................................. 57344 [default0]: finetune ........................................ False [default0]: fp16 ............................................ False [default0]: fp16_lm_cross_entropy ........................... False [default0]: fp32_residual_connection ........................ False [default0]: gigaflos_no_embeds .............................. 0 [default0]: global_batch_size ............................... 2048 [default0]: glu_activation .................................. None [default0]: hidden_dropout .................................. 0.1 [default0]: hidden_size ..................................... 14336 [default0]: hysteresis ...................................... 2 [default0]: ict_head_size ................................... None [default0]: ict_load ........................................ None [default0]: img_dim ......................................... 224 [default0]: indexer_batch_size .............................. 128 [default0]: indexer_log_interval ............................ 1000 [default0]: init_method_std ................................. 0.0048 [default0]: init_method_xavier_uniform ...................... False [default0]: initial_loss_scale .............................. 4294967296 [default0]: kill_switch_path ................................ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/kill-switch-tr11-176B-exp1 [default0]: kv_channels ..................................... 128 [default0]: layernorm_epsilon ............................... 1e-05 [default0]: lazy_mpu_init ................................... None [default0]: load ............................................ /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default0]: local_rank ...................................... None [default0]: log_batch_size_to_tensorboard ................... True [default0]: log_interval .................................... 1 [default0]: log_learning_rate_to_tensorboard ................ True [default0]: log_level ....................................... None [default0]: log_level_replica ............................... None [default0]: log_loss_scale_to_tensorboard ................... True [default0]: log_num_zeros_in_grad ........................... False [default0]: log_params_norm ................................. False [default0]: log_path ........................................ None [default0]: log_timers_to_tensorboard ....................... True [default0]: log_validation_ppl_to_tensorboard ............... True [default0]: loss_on_targets_only ............................ False [default0]: loss_scale ...................................... None [default0]: loss_scale_window ............................... 1000 [default0]: lr .............................................. 6e-05 [default0]: lr_decay_iters .................................. None [default0]: lr_decay_samples ................................ 200000000 [default0]: lr_decay_style .................................. cosine [default0]: lr_decay_tokens ................................. None [default0]: lr_warmup_fraction .............................. None [default0]: lr_warmup_iters ................................. 0 [default0]: lr_warmup_samples ............................... 183105 [default0]: make_vocab_size_divisible_by .................... 128 [default0]: mask_prob ....................................... 0.15 [default0]: masked_softmax_fusion ........................... True [default0]: max_position_embeddings ......................... 2048 [default0]: memory_centric_tiled_linear ..................... False [default0]: merge_file ...................................... None [default0]: micro_batch_size ................................ 2 [default0]: min_loss_scale .................................. 1.0 [default0]: min_lr .......................................... 6e-06 [default0]: mmap_warmup ..................................... False [default0]: no_load_optim ................................... None [default0]: no_load_rng ..................................... None [default0]: no_save_optim ................................... None [default0]: no_save_rng ..................................... None [default0]: num_attention_heads ............................. 112 [default0]: num_channels .................................... 3 [default0]: num_classes ..................................... 1000 [default0]: num_layers ...................................... 70 [default0]: num_layers_per_virtual_pipeline_stage ........... None [default0]: num_workers ..................................... 2 [default0]: onnx_safe ....................................... None [default0]: openai_gelu ..................................... False [default0]: optimizer ....................................... adam [default0]: override_lr_scheduler ........................... False [default0]: pad_vocab_size_to ............................... 250880 [default0]: params_dtype .................................... torch.bfloat16 [default0]: partition_activations ........................... False [default0]: patch_dim ....................................... 16 [default0]: pipeline_model_parallel_size .................... 12 [default0]: position_embedding_type ......................... PositionEmbeddingType.alibi [default0]: pp_partition_method ............................. type:transformer|embedding [default0]: profile_backward ................................ False [default0]: query_in_block_prob ............................. 0.1 [default0]: rampup_batch_size ............................... ['16', '16', '9_765_625'] [default0]: rank ............................................ 0 [default0]: remote_device ................................... none [default0]: reset_attention_mask ............................ False [default0]: reset_position_ids .............................. False [default0]: retriever_report_topk_accuracies ................ [] [default0]: retriever_score_scaling ......................... False [default0]: retriever_seq_length ............................ 256 [default0]: reweight_loss_based_on_position_frequency ....... False [default0]: sample_rate ..................................... 1.0 [default0]: save ............................................ /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default0]: save_interval ................................... 500 [default0]: scatter_gather_tensors_in_pipeline .............. True [default0]: scattered_embeddings ............................ False [default0]: seed ............................................ 42 [default0]: seq_length ...................................... 2048 [default0]: sgd_momentum .................................... 0.9 [default0]: short_seq_prob .................................. 0.1 [default0]: skip_train_iteration_range ...................... None [default0]: split ........................................... None [default0]: split_transformers .............................. False [default0]: synchronize_each_layer .......................... False [default0]: tensor_model_parallel_size ...................... 4 [default0]: tensorboard_dir ................................. /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/tr11-176B-ml-logs/tensorboard [default0]: tensorboard_log_interval ........................ 1 [default0]: tensorboard_queue_size .......................... 5 [default0]: test_weighted_split_names ....................... ['test'] [default0]: test_weighted_split_paths ....................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']] [default0]: test_weighted_split_paths_path .................. None [default0]: test_weighted_split_splits ...................... [['0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0']] [default0]: test_weighted_split_weights ..................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']] [default0]: tile_factor ..................................... 1 [default0]: titles_data_path ................................ None [default0]: tokenizer_name_or_path .......................... bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k [default0]: tokenizer_type .................................. PretrainedFromHF [default0]: train_iters ..................................... None [default0]: train_samples ................................... 220000000 [default0]: train_tokens .................................... None [default0]: train_weighted_split_names ...................... ['train'] [default0]: train_weighted_split_paths ...................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']] [default0]: train_weighted_split_paths_path ................. None [default0]: train_weighted_split_splits ..................... [['0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949']] [default0]: train_weighted_split_weights .................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']] [default0]: use_bnb_optimizer ............................... False [default0]: use_checkpoint_lr_scheduler ..................... False [default0]: use_contiguous_buffers_in_ddp ................... True [default0]: use_cpu_initialization .......................... None [default0]: use_one_sent_docs ............................... False [default0]: use_pin_memory .................................. False [default0]: valid_weighted_split_names ...................... ['valid'] [default0]: valid_weighted_split_paths ...................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']] [default0]: valid_weighted_split_paths_path ................. None [default0]: valid_weighted_split_splits ..................... [['0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999']] [default0]: valid_weighted_split_weights .................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']] [default0]: virtual_pipeline_model_parallel_size ............ None [default0]: vocab_extra_ids ................................. 0 [default0]: vocab_file ...................................... None [default0]: weight_decay .................................... 0.1 [default0]: world_size ...................................... 384 [default0]: zero_allgather_bucket_size ...................... 0.0 [default0]: zero_contigious_gradients ....................... False [default0]: zero_reduce_bucket_size ......................... 0.0 [default0]: zero_reduce_scatter ............................. False [default0]: zero_stage ...................................... 0 [default0]:-------------------- end of arguments --------------------- [default0]:will use batch size rampup starting from global batch size 16 to global batch size 2048 with batch size increments 16 over 9765625 samples. [default0]:> building PretrainedFromHF tokenizer ... [default0]: vocab file is un-used. loading tokenizer from pre-trained model [default0]:Offline mode: forcing local_files_only=True [default0]:Can't load following files from cache: ['added_tokens_file'] and cannot check if these files are necessary for the tokenizer to operate. [default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/special_tokens_map.json from cache at /gpfswork/rech/six/commun/models/b0b3428eb9bea3ef62a6e9983742117e4860f4ec1af66eebce1702b8ec7cb364.9d6cd81ef646692fb1c169a880161ea1cb95f49694f220aced9b704b457e51dd [default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/tokenizer_config.json from cache at /gpfswork/rech/six/commun/models/31fb66a88196017b3a12c4798e55bcf8a11b312b42dd9429c83f7237c0a8a807.e683c1a11fe6388761e34fd7cddbcd77f3552cefb70e9aca4a4cc72c027c8f40 [default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/tokenizer.json from cache at /gpfswork/rech/six/commun/models/b28b4c1d8aed4c72b765cce6a9a7ce8c5460d05a5b4ea6fa5855dff6a721d171.397b0d7316cb89fa15f0bebce2bd6c5e71e92a14e95de167940173a60253b03e [default0]: > padded vocab (size: 250680) with 200 dummy tokens (new size: 250880) [default0]:DeepSpeed general environment info: [default0]:torch install path ............... ['/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch'] [default0]:torch version .................... 1.11.0+cu115 [default0]:torch cuda version ............... 11.5 [default0]:nvcc version ..................... 11.4 [default0]:deepspeed install path ........... ['/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed'] [default0]:deepspeed info ................... 0.6.0+ed26ef4, ed26ef4, olruwase/bf16-updates [default0]:deepspeed wheel compiled w. ...... torch 1.11, cuda 11.5 [default0]:**** Git info for Megatron: git_hash=0415583 git_branch=sync-meg-lm **** [default0]:> initializing torch distributed ... [default0]:> initializing tensor model parallel with size 4 [default0]:> initializing pipeline model parallel with size 12 srun: Job step aborted: Waiting up to 62 seconds for job step to finish. WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252031 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252032 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 289979 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 251437 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 263080 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252333 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 251438 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 251439 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252033 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 251440 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 289980 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254984 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 289981 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 251441 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254820 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 289982 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 88310 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254821 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 251442 closing signal SIGTERM slurmstepd: error: *** STEP 202316.0 ON jean-zay-iam01 CANCELLED AT 2022-03-04T03:56:57 *** WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252334 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253385 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 263081 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 251443 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252335 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226151 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 263082 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 251444 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227877 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252336 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254985 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 263083 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227878 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252337 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 256749 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254986 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 263084 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252193 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230320 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254822 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254987 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 263085 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229012 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253386 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 88311 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252194 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254988 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 263086 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254823 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 289983 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254989 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226152 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 268918 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252034 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253387 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 263087 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 289984 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254990 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 88312 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252318 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227879 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253388 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226153 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 289985 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252035 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254991 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252036 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226154 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 289986 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253389 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285878 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 88313 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252037 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253390 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 256750 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226155 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 76675 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254417 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252338 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226156 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252038 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230321 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244719 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253391 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226157 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253392 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252195 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 76676 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 89307 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 226158 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230322 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 256751 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229013 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 76677 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 268919 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252339 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252340 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254824 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252196 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246841 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244720 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254825 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 128661 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230323 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252197 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 88314 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254826 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 268920 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 88315 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244607 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230324 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285879 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254827 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252198 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249592 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 88316 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252319 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209773 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230325 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285880 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 88317 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252199 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 268921 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246157 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 268922 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285881 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 76678 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254418 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230326 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 256752 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252200 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 89308 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285882 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230327 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244721 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227880 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285883 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229014 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246158 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 76679 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285884 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 258609 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254419 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 89309 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244722 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 256753 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246842 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 76680 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 128662 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 250113 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244723 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 256754 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285885 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229015 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 258610 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 76681 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254420 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230965 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244724 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 232320 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249593 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 256755 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229016 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249003 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247864 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254421 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247917 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249594 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 256756 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246843 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244608 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227813 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 268923 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247676 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227881 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247121 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209774 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252320 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249595 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246159 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246844 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 268924 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252321 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227882 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244609 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 268925 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227814 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252322 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246160 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 128663 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227883 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209775 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244610 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252323 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 258611 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242420 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 250114 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247280 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227884 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227815 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 89310 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229017 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246161 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 128664 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 107811 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 220558 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 76682 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252324 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253853 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 89311 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 250115 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247865 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254422 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244611 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247918 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249004 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 128665 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 252325 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227816 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 89312 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 232321 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229018 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 258612 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246162 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247866 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230966 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 250116 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 89313 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 258613 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249005 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246845 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247867 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247919 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 232322 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247677 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227817 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 89314 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247122 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247868 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246559 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 258614 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230967 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227818 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247920 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 128666 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247869 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249596 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 258615 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247762 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246846 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255033 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242106 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209776 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227819 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247679 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247870 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 232323 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247921 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249597 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 258616 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247871 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230968 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227820 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247281 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209777 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249598 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244612 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247680 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 232325 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253854 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242421 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229019 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 107812 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247282 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 220559 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244613 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246163 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247681 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209778 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249599 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254423 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253855 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 250117 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244614 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246164 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247682 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247283 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 220560 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 107813 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209779 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254424 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 128667 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 220561 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247683 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253856 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 209780 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 128668 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 250118 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246560 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 107814 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242422 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247284 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 220562 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247123 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247684 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247763 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242107 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 220563 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246847 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 250119 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247285 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253857 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246561 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242423 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242108 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247764 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 107815 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 220564 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244725 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255034 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247124 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247286 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249006 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242424 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 250120 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 220565 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242109 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247287 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 244726 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247125 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246848 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 232326 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247765 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 107816 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242425 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242110 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242426 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247922 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255035 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247766 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 107817 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242111 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242427 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 232327 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242112 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 107818 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230969 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242113 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230970 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 232328 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246562 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230971 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249007 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230972 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246563 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249008 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247126 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253858 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246564 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249009 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255036 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246565 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249010 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247923 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246566 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247767 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247924 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255037 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247768 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247769 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255038 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255039 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255040 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247127 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247128 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253859 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253860 closing signal SIGTERM WARNING:torch.distributed.elastic.agent.server.api:Received 15 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 296307 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 296308 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 296309 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 296310 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 296311 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 296312 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 296313 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 296314 closing signal SIGTERM Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 236, in launch_agent result = agent.run() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper result = f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run result = self._invoke_run(role) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 850, in _invoke_run Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main time.sleep(monitor_interval) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code torch.distributed.elastic.multiprocessing.api.SignalException: Process 253741 got signal: 15 exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 236, in launch_agent result = agent.run() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code result = f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 236, in launch_agent result = self._invoke_run(role) result = agent.run() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 850, in _invoke_run result = f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run time.sleep(monitor_interval) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 285766 got signal: 15 result = self._invoke_run(role) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 850, in _invoke_run time.sleep(monitor_interval) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 246447 got signal: 15 Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 236, in launch_agent result = agent.run() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper result = f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run result = self._invoke_run(role) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 850, in _invoke_run time.sleep(monitor_interval) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 209662 got signal: 15 WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** [default0]:using world size: 384, data-parallel-size: 8, tensor-model-parallel size: 4, pipeline-model-parallel size: 12 [default0]:WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer with tokenizer_type:PretrainedFromHF [default0]:accumulate and all-reduce gradients in fp32 for bfloat16 data type. [default0]:using torch.bfloat16 for parameters ... [default0]:------------------------ arguments ------------------------ [default0]: abort_on_unmet_fused_kernel_constraints ......... True [default0]: accumulate_allreduce_grads_in_fp32 .............. True [default0]: adam_beta1 ...................................... 0.9 [default0]: adam_beta2 ...................................... 0.95 [default0]: adam_eps ........................................ 1e-08 [default0]: adlr_autoresume ................................. False [default0]: adlr_autoresume_interval ........................ 1000 [default0]: apply_query_key_layer_scaling ................... True [default0]: apply_residual_connection_post_layernorm ........ False [default0]: attention_dropout ............................... 0.1 [default0]: attention_softmax_in_fp32 ....................... False [default0]: bert_binary_head ................................ True [default0]: bert_load ....................................... None [default0]: bf16 ............................................ True [default0]: bias_dropout_fusion ............................. True [default0]: bias_gelu_fusion ................................ True [default0]: biencoder_projection_dim ........................ 0 [default0]: biencoder_shared_query_context_model ............ False [default0]: block_data_path ................................. None [default0]: checkpoint_activations .......................... True [default0]: checkpoint_in_cpu ............................... False [default0]: checkpoint_num_layers ........................... 1 [default0]: clip_grad ....................................... 1.0 [default0]: codecarbon_dir .................................. None [default0]: consumed_train_samples .......................... 0 [default0]: consumed_train_tokens ........................... 0 [default0]: consumed_valid_samples .......................... 0 [default0]: contigious_checkpointing ........................ False [default0]: cpu_optimizer ................................... False [default0]: cpu_torch_adam .................................. False [default0]: curriculum_learning ............................. False [default0]: data_impl ....................................... mmap [default0]: data_parallel_size .............................. 8 [default0]: data_path ....................................... None [default0]: dataloader_type ................................. single [default0]: DDP_impl ........................................ local [default0]: decoder_seq_length .............................. None [default0]: deepscale ....................................... False [default0]: deepscale_config ................................ None [default0]: deepspeed ....................................... True [default0]: deepspeed_activation_checkpointing .............. True [default0]: deepspeed_config ................................ ./ds_config.202322.json [default0]: deepspeed_mpi ................................... False [default0]: distribute_checkpointed_activations ............. False [default0]: distributed_backend ............................. nccl [default0]: embed_layernorm ................................. True [default0]: embedding_path .................................. None [default0]: encoder_seq_length .............................. 2048 [default0]: eod_mask_loss ................................... False [default0]: eval_interval ................................... 1000 [default0]: eval_iters ...................................... 10 [default0]: eval_only ....................................... None [default0]: evidence_data_path .............................. None [default0]: exit_duration_in_mins ........................... 5990 [default0]: exit_interval ................................... None [default0]: ffn_hidden_size ................................. 57344 [default0]: finetune ........................................ False [default0]: fp16 ............................................ False [default0]: fp16_lm_cross_entropy ........................... False [default0]: fp32_residual_connection ........................ False [default0]: gigaflos_no_embeds .............................. 0 [default0]: global_batch_size ............................... 2048 [default0]: glu_activation .................................. None [default0]: hidden_dropout .................................. 0.1 [default0]: hidden_size ..................................... 14336 [default0]: hysteresis ...................................... 2 [default0]: ict_head_size ................................... None [default0]: ict_load ........................................ None [default0]: img_dim ......................................... 224 [default0]: indexer_batch_size .............................. 128 [default0]: indexer_log_interval ............................ 1000 [default0]: init_method_std ................................. 0.0048 [default0]: init_method_xavier_uniform ...................... False [default0]: initial_loss_scale .............................. 4294967296 [default0]: kill_switch_path ................................ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/kill-switch-tr11-176B-exp1 [default0]: kv_channels ..................................... 128 [default0]: layernorm_epsilon ............................... 1e-05 [default0]: lazy_mpu_init ................................... None [default0]: load ............................................ /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default0]: local_rank ...................................... None [default0]: log_batch_size_to_tensorboard ................... True [default0]: log_interval .................................... 1 [default0]: log_learning_rate_to_tensorboard ................ True [default0]: log_level ....................................... None [default0]: log_level_replica ............................... None [default0]: log_loss_scale_to_tensorboard ................... True [default0]: log_num_zeros_in_grad ........................... False [default0]: log_params_norm ................................. False [default0]: log_path ........................................ None [default0]: log_timers_to_tensorboard ....................... True [default0]: log_validation_ppl_to_tensorboard ............... True [default0]: loss_on_targets_only ............................ False [default0]: loss_scale ...................................... None [default0]: loss_scale_window ............................... 1000 [default0]: lr .............................................. 6e-05 [default0]: lr_decay_iters .................................. None [default0]: lr_decay_samples ................................ 200000000 [default0]: lr_decay_style .................................. cosine [default0]: lr_decay_tokens ................................. None [default0]: lr_warmup_fraction .............................. None [default0]: lr_warmup_iters ................................. 0 [default0]: lr_warmup_samples ............................... 183105 [default0]: make_vocab_size_divisible_by .................... 128 [default0]: mask_prob ....................................... 0.15 [default0]: masked_softmax_fusion ........................... True [default0]: max_position_embeddings ......................... 2048 [default0]: memory_centric_tiled_linear ..................... False [default0]: merge_file ...................................... None [default0]: micro_batch_size ................................ 2 [default0]: min_loss_scale .................................. 1.0 [default0]: min_lr .......................................... 6e-06 [default0]: mmap_warmup ..................................... False [default0]: no_load_optim ................................... None [default0]: no_load_rng ..................................... None [default0]: no_save_optim ................................... None [default0]: no_save_rng ..................................... None [default0]: num_attention_heads ............................. 112 [default0]: num_channels .................................... 3 [default0]: num_classes ..................................... 1000 [default0]: num_layers ...................................... 70 [default0]: num_layers_per_virtual_pipeline_stage ........... None [default0]: num_workers ..................................... 2 [default0]: onnx_safe ....................................... None [default0]: openai_gelu ..................................... False [default0]: optimizer ....................................... adam [default0]: override_lr_scheduler ........................... False [default0]: pad_vocab_size_to ............................... 250880 [default0]: params_dtype .................................... torch.bfloat16 [default0]: partition_activations ........................... False [default0]: patch_dim ....................................... 16 [default0]: pipeline_model_parallel_size .................... 12 [default0]: position_embedding_type ......................... PositionEmbeddingType.alibi [default0]: pp_partition_method ............................. type:transformer|embedding [default0]: profile_backward ................................ False [default0]: query_in_block_prob ............................. 0.1 [default0]: rampup_batch_size ............................... ['16', '16', '9_765_625'] [default0]: rank ............................................ 0 [default0]: remote_device ................................... none [default0]: reset_attention_mask ............................ False [default0]: reset_position_ids .............................. False [default0]: retriever_report_topk_accuracies ................ [] [default0]: retriever_score_scaling ......................... False [default0]: retriever_seq_length ............................ 256 [default0]: reweight_loss_based_on_position_frequency ....... False [default0]: sample_rate ..................................... 1.0 [default0]: save ............................................ /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default0]: save_interval ................................... 500 [default0]: scatter_gather_tensors_in_pipeline .............. True [default0]: scattered_embeddings ............................ False [default0]: seed ............................................ 42 [default0]: seq_length ...................................... 2048 [default0]: sgd_momentum .................................... 0.9 [default0]: short_seq_prob .................................. 0.1 [default0]: skip_train_iteration_range ...................... None [default0]: split ........................................... None [default0]: split_transformers .............................. False [default0]: synchronize_each_layer .......................... False [default0]: tensor_model_parallel_size ...................... 4 [default0]: tensorboard_dir ................................. /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/tr11-176B-ml-logs/tensorboard [default0]: tensorboard_log_interval ........................ 1 [default0]: tensorboard_queue_size .......................... 5 [default0]: test_weighted_split_names ....................... ['test'] [default0]: test_weighted_split_paths ....................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']] [default0]: test_weighted_split_paths_path .................. None [default0]: test_weighted_split_splits ...................... [['0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0']] [default0]: test_weighted_split_weights ..................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']] [default0]: tile_factor ..................................... 1 [default0]: titles_data_path ................................ None [default0]: tokenizer_name_or_path .......................... bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k [default0]: tokenizer_type .................................. PretrainedFromHF [default0]: train_iters ..................................... None [default0]: train_samples ................................... 220000000 [default0]: train_tokens .................................... None [default0]: train_weighted_split_names ...................... ['train'] [default0]: train_weighted_split_paths ...................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']] [default0]: train_weighted_split_paths_path ................. None [default0]: train_weighted_split_splits ..................... [['0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949']] [default0]: train_weighted_split_weights .................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']] [default0]: use_bnb_optimizer ............................... False [default0]: use_checkpoint_lr_scheduler ..................... False [default0]: use_contiguous_buffers_in_ddp ................... True [default0]: use_cpu_initialization .......................... None [default0]: use_one_sent_docs ............................... False [default0]: use_pin_memory .................................. False [default0]: valid_weighted_split_names ...................... ['valid'] [default0]: valid_weighted_split_paths ...................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']] [default0]: valid_weighted_split_paths_path ................. None [default0]: valid_weighted_split_splits ..................... [['0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999']] [default0]: valid_weighted_split_weights .................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']] [default0]: virtual_pipeline_model_parallel_size ............ None [default0]: vocab_extra_ids ................................. 0 [default0]: vocab_file ...................................... None [default0]: weight_decay .................................... 0.1 [default0]: world_size ...................................... 384 [default0]: zero_allgather_bucket_size ...................... 0.0 [default0]: zero_contigious_gradients ....................... False [default0]: zero_reduce_bucket_size ......................... 0.0 [default0]: zero_reduce_scatter ............................. False [default0]: zero_stage ...................................... 0 [default0]:-------------------- end of arguments --------------------- [default0]:will use batch size rampup starting from global batch size 16 to global batch size 2048 with batch size increments 16 over 9765625 samples. [default0]:> building PretrainedFromHF tokenizer ... [default0]: vocab file is un-used. loading tokenizer from pre-trained model [default0]:Offline mode: forcing local_files_only=True [default0]:Offline mode: forcing local_files_only=True [default0]:Can't load following files from cache: ['added_tokens_file'] and cannot check if these files are necessary for the tokenizer to operate. [default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/special_tokens_map.json from cache at /gpfswork/rech/six/commun/models/b0b3428eb9bea3ef62a6e9983742117e4860f4ec1af66eebce1702b8ec7cb364.9d6cd81ef646692fb1c169a880161ea1cb95f49694f220aced9b704b457e51dd [default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/tokenizer_config.json from cache at /gpfswork/rech/six/commun/models/31fb66a88196017b3a12c4798e55bcf8a11b312b42dd9429c83f7237c0a8a807.e683c1a11fe6388761e34fd7cddbcd77f3552cefb70e9aca4a4cc72c027c8f40 [default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/tokenizer.json from cache at /gpfswork/rech/six/commun/models/b28b4c1d8aed4c72b765cce6a9a7ce8c5460d05a5b4ea6fa5855dff6a721d171.397b0d7316cb89fa15f0bebce2bd6c5e71e92a14e95de167940173a60253b03e [default7]:> setting tensorboard ... [default0]: > padded vocab (size: 250680) with 200 dummy tokens (new size: 250880) [default0]:DeepSpeed general environment info: [default0]:torch install path ............... ['/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch'] [default0]:torch version .................... 1.11.0+cu115 [default0]:torch cuda version ............... 11.5 [default0]:nvcc version ..................... 11.4 [default0]:deepspeed install path ........... ['/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed'] [default0]:deepspeed info ................... 0.6.0+ed26ef4, ed26ef4, olruwase/bf16-updates [default0]:deepspeed wheel compiled w. ...... torch 1.11, cuda 11.5 [default0]:**** Git info for Megatron: git_hash=0415583 git_branch=sync-meg-lm **** [default0]:> initializing torch distributed ... [default0]:> initializing tensor model parallel with size 4 [default0]:> initializing pipeline model parallel with size 12 [default0]:> setting random seeds to 42 ... [default0]:[2022-03-04 04:02:43,890] [INFO] [checkpointing.py:226:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2760 and data parallel seed: 42 [default0]:> compiling dataset index builder ... [default0]:make: Entering directory '/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/data' [default0]:make: Nothing to be done for 'default'. [default0]:make: Leaving directory '/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/data' [default0]:>>> done with dataset index builder. Compilation time: 0.103 seconds [default0]:> compiling and loading fused kernels ... [default0]:Detected CUDA files, patching ldflags [default0]:Emitting ninja build file /gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... [default0]:Building extension module scaled_upper_triang_masked_softmax_cuda... [default0]:Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [default0]:ninja: no work to do. [default0]:Loading extension module scaled_upper_triang_masked_softmax_cuda... [default0]:Detected CUDA files, patching ldflags [default0]:Emitting ninja build file /gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... [default0]:Building extension module scaled_masked_softmax_cuda... [default0]:Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [default0]:ninja: no work to do. [default0]:Loading extension module scaled_masked_softmax_cuda... [default0]:Detected CUDA files, patching ldflags [default0]:Emitting ninja build file /gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... [default0]:Building extension module fused_mix_prec_layer_norm_cuda... [default0]:Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [default0]:ninja: no work to do. [default0]:Loading extension module fused_mix_prec_layer_norm_cuda... [default0]:>>> done with compiling and loading fused kernels. Compilation time: 9.563 seconds [default0]:time to initialize megatron (seconds): 93.559 [default0]:[after megatron is initialized] datetime: 2022-03-04 04:02:53 [default0]:building GPT model ... [default0]:[2022-03-04 04:02:53,586] [INFO] [utils.py:828:see_memory_usage] Before Building Model [default0]:[2022-03-04 04:02:53,587] [INFO] [utils.py:829:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [default0]:[2022-03-04 04:02:53,587] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.25 GB, percent = 8.6% [default0]:SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None [default0]:Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=0, model=1): 1, ProcessCoord(pipe=0, data=0, model=2): 2, ProcessCoord(pipe=0, data=0, model=3): 3, ProcessCoord(pipe=0, data=1, model=0): 4, ProcessCoord(pipe=0, data=1, model=1): 5, ProcessCoord(pipe=0, data=1, model=2): 6, ProcessCoord(pipe=0, data=1, model=3): 7, ProcessCoord(pipe=0, data=2, model=0): 8, ProcessCoord(pipe=0, data=2, model=1): 9, ProcessCoord(pipe=0, data=2, model=2): 10, ProcessCoord(pipe=0, data=2, model=3): 11, ProcessCoord(pipe=0, data=3, model=0): 12, ProcessCoord(pipe=0, data=3, model=1): 13, ProcessCoord(pipe=0, data=3, model=2): 14, ProcessCoord(pipe=0, data=3, model=3): 15, ProcessCoord(pipe=0, data=4, model=0): 16, ProcessCoord(pipe=0, data=4, model=1): 17, ProcessCoord(pipe=0, data=4, model=2): 18, ProcessCoord(pipe=0, data=4, model=3): 19, ProcessCoord(pipe=0, data=5, model=0): 20, ProcessCoord(pipe=0, data=5, model=1): 21, ProcessCoord(pipe=0, data=5, model=2): 22, ProcessCoord(pipe=0, data=5, model=3): 23, ProcessCoord(pipe=0, data=6, model=0): 24, ProcessCoord(pipe=0, data=6, model=1): 25, ProcessCoord(pipe=0, data=6, model=2): 26, ProcessCoord(pipe=0, data=6, model=3): 27, ProcessCoord(pipe=0, data=7, model=0): 28, ProcessCoord(pipe=0, data=7, model=1): 29, ProcessCoord(pipe=0, data=7, model=2): 30, ProcessCoord(pipe=0, data=7, model=3): 31, ProcessCoord(pipe=1, data=0, model=0): 32, ProcessCoord(pipe=1, data=0, model=1): 33, ProcessCoord(pipe=1, data=0, model=2): 34, ProcessCoord(pipe=1, data=0, model=3): 35, ProcessCoord(pipe=1, data=1, model=0): 36, ProcessCoord(pipe=1, data=1, model=1): 37, ProcessCoord(pipe=1, data=1, model=2): 38, ProcessCoord(pipe=1, data=1, model=3): 39, ProcessCoord(pipe=1, data=2, model=0): 40, ProcessCoord(pipe=1, data=2, model=1): 41, ProcessCoord(pipe=1, data=2, model=2): 42, ProcessCoord(pipe=1, data=2, model=3): 43, ProcessCoord(pipe=1, data=3, model=0): 44, ProcessCoord(pipe=1, data=3, model=1): 45, ProcessCoord(pipe=1, data=3, model=2): 46, ProcessCoord(pipe=1, data=3, model=3): 47, ProcessCoord(pipe=1, data=4, model=0): 48, ProcessCoord(pipe=1, data=4, model=1): 49, ProcessCoord(pipe=1, data=4, model=2): 50, ProcessCoord(pipe=1, data=4, model=3): 51, ProcessCoord(pipe=1, data=5, model=0): 52, ProcessCoord(pipe=1, data=5, model=1): 53, ProcessCoord(pipe=1, data=5, model=2): 54, ProcessCoord(pipe=1, data=5, model=3): 55, ProcessCoord(pipe=1, data=6, model=0): 56, ProcessCoord(pipe=1, data=6, model=1): 57, ProcessCoord(pipe=1, data=6, model=2): 58, ProcessCoord(pipe=1, data=6, model=3): 59, ProcessCoord(pipe=1, data=7, model=0): 60, ProcessCoord(pipe=1, data=7, model=1): 61, ProcessCoord(pipe=1, data=7, model=2): 62, ProcessCoord(pipe=1, data=7, model=3): 63, ProcessCoord(pipe=2, data=0, model=0): 64, ProcessCoord(pipe=2, data=0, model=1): 65, ProcessCoord(pipe=2, data=0, model=2): 66, ProcessCoord(pipe=2, data=0, model=3): 67, ProcessCoord(pipe=2, data=1, model=0): 68, ProcessCoord(pipe=2, data=1, model=1): 69, ProcessCoord(pipe=2, data=1, model=2): 70, ProcessCoord(pipe=2, data=1, model=3): 71, ProcessCoord(pipe=2, data=2, model=0): 72, ProcessCoord(pipe=2, data=2, model=1): 73, ProcessCoord(pipe=2, data=2, model=2): 74, ProcessCoord(pipe=2, data=2, model=3): 75, ProcessCoord(pipe=2, data=3, model=0): 76, ProcessCoord(pipe=2, data=3, model=1): 77, ProcessCoord(pipe=2, data=3, model=2): 78, ProcessCoord(pipe=2, data=3, model=3): 79, ProcessCoord(pipe=2, data=4, model=0): 80, ProcessCoord(pipe=2, data=4, model=1): 81, ProcessCoord(pipe=2, data=4, model=2): 82, ProcessCoord(pipe=2, data=4, model=3): 83, ProcessCoord(pipe=2, data=5, model=0): 84, ProcessCoord(pipe=2, data=5, model=1): 85, ProcessCoord(pipe=2, data=5, model=2): 86, ProcessCoord(pipe=2, data=5, model=3): 87, ProcessCoord(pipe=2, data=6, model=0): 88, ProcessCoord(pipe=2, data=6, model=1): 89, ProcessCoord(pipe=2, data=6, model=2): 90, ProcessCoord(pipe=2, data=6, model=3): 91, ProcessCoord(pipe=2, data=7, model=0): 92, ProcessCoord(pipe=2, data=7, model=1): 93, ProcessCoord(pipe=2, data=7, model=2): 94, ProcessCoord(pipe=2, data=7, model=3): 95, ProcessCoord(pipe=3, data=0, model=0): 96, ProcessCoord(pipe=3, data=0, model=1): 97, ProcessCoord(pipe=3, data=0, model=2): 98, ProcessCoord(pipe=3, data=0, model=3): 99, ProcessCoord(pipe=3, data=1, model=0): 100, ProcessCoord(pipe=3, data=1, model=1): 101, ProcessCoord(pipe=3, data=1, model=2): 102, ProcessCoord(pipe=3, data=1, model=3): 103, ProcessCoord(pipe=3, data=2, model=0): 104, ProcessCoord(pipe=3, data=2, model=1): 105, ProcessCoord(pipe=3, data=2, model=2): 106, ProcessCoord(pipe=3, data=2, model=3): 107, ProcessCoord(pipe=3, data=3, model=0): 108, ProcessCoord(pipe=3, data=3, model=1): 109, ProcessCoord(pipe=3, data=3, model=2): 110, ProcessCoord(pipe=3, data=3, model=3): 111, ProcessCoord(pipe=3, data=4, model=0): 112, ProcessCoord(pipe=3, data=4, model=1): 113, ProcessCoord(pipe=3, data=4, model=2): 114, ProcessCoord(pipe=3, data=4, model=3): 115, ProcessCoord(pipe=3, data=5, model=0): 116, ProcessCoord(pipe=3, data=5, model=1): 117, ProcessCoord(pipe=3, data=5, model=2): 118, ProcessCoord(pipe=3, data=5, model=3): 119, ProcessCoord(pipe=3, data=6, model=0): 120, ProcessCoord(pipe=3, data=6, model=1): 121, ProcessCoord(pipe=3, data=6, model=2): 122, ProcessCoord(pipe=3, data=6, model=3): 123, ProcessCoord(pipe=3, data=7, model=0): 124, ProcessCoord(pipe=3, data=7, model=1): 125, ProcessCoord(pipe=3, data=7, model=2): 126, ProcessCoord(pipe=3, data=7, model=3): 127, ProcessCoord(pipe=4, data=0, model=0): 128, ProcessCoord(pipe=4, data=0, model=1): 129, ProcessCoord(pipe=4, data=0, model=2): 130, ProcessCoord(pipe=4, data=0, model=3): 131, ProcessCoord(pipe=4, data=1, model=0): 132, ProcessCoord(pipe=4, data=1, model=1): 133, ProcessCoord(pipe=4, data=1, model=2): 134, ProcessCoord(pipe=4, data=1, model=3): 135, ProcessCoord(pipe=4, data=2, model=0): 136, ProcessCoord(pipe=4, data=2, model=1): 137, ProcessCoord(pipe=4, data=2, model=2): 138, ProcessCoord(pipe=4, data=2, model=3): 139, ProcessCoord(pipe=4, data=3, model=0): 140, ProcessCoord(pipe=4, data=3, model=1): 141, ProcessCoord(pipe=4, data=3, model=2): 142, ProcessCoord(pipe=4, data=3, model=3): 143, ProcessCoord(pipe=4, data=4, model=0): 144, ProcessCoord(pipe=4, data=4, model=1): 145, ProcessCoord(pipe=4, data=4, model=2): 146, ProcessCoord(pipe=4, data=4, model=3): 147, ProcessCoord(pipe=4, data=5, model=0): 148, ProcessCoord(pipe=4, data=5, model=1): 149, ProcessCoord(pipe=4, data=5, model=2): 150, ProcessCoord(pipe=4, data=5, model=3): 151, ProcessCoord(pipe=4, data=6, model=0): 152, ProcessCoord(pipe=4, data=6, model=1): 153, ProcessCoord(pipe=4, data=6, model=2): 154, ProcessCoord(pipe=4, data=6, model=3): 155, ProcessCoord(pipe=4, data=7, model=0): 156, ProcessCoord(pipe=4, data=7, model=1): 157, ProcessCoord(pipe=4, data=7, model=2): 158, ProcessCoord(pipe=4, data=7, model=3): 159, ProcessCoord(pipe=5, data=0, model=0): 160, ProcessCoord(pipe=5, data=0, model=1): 161, ProcessCoord(pipe=5, data=0, model=2): 162, ProcessCoord(pipe=5, data=0, model=3): 163, ProcessCoord(pipe=5, data=1, model=0): 164, ProcessCoord(pipe=5, data=1, model=1): 165, ProcessCoord(pipe=5, data=1, model=2): 166, ProcessCoord(pipe=5, data=1, model=3): 167, ProcessCoord(pipe=5, data=2, model=0): 168, ProcessCoord(pipe=5, data=2, model=1): 169, ProcessCoord(pipe=5, data=2, model=2): 170, ProcessCoord(pipe=5, data=2, model=3): 171, ProcessCoord(pipe=5, data=3, model=0): 172, ProcessCoord(pipe=5, data=3, model=1): 173, ProcessCoord(pipe=5, data=3, model=2): 174, ProcessCoord(pipe=5, data=3, model=3): 175, ProcessCoord(pipe=5, data=4, model=0): 176, ProcessCoord(pipe=5, data=4, model=1): 177, ProcessCoord(pipe=5, data=4, model=2): 178, ProcessCoord(pipe=5, data=4, model=3): 179, ProcessCoord(pipe=5, data=5, model=0): 180, ProcessCoord(pipe=5, data=5, model=1): 181, ProcessCoord(pipe=5, data=5, model=2): 182, ProcessCoord(pipe=5, data=5, model=3): 183, ProcessCoord(pipe=5, data=6, model=0): 184, ProcessCoord(pipe=5, data=6, model=1): 185, ProcessCoord(pipe=5, data=6, model=2): 186, ProcessCoord(pipe=5, data=6, model=3): 187, ProcessCoord(pipe=5, data=7, model=0): 188, ProcessCoord(pipe=5, data=7, model=1): 189, ProcessCoord(pipe=5, data=7, model=2): 190, ProcessCoord(pipe=5, data=7, model=3): 191, ProcessCoord(pipe=6, data=0, model=0): 192, ProcessCoord(pipe=6, data=0, model=1): 193, ProcessCoord(pipe=6, data=0, model=2): 194, ProcessCoord(pipe=6, data=0, model=3): 195, ProcessCoord(pipe=6, data=1, model=0): 196, ProcessCoord(pipe=6, data=1, model=1): 197, ProcessCoord(pipe=6, data=1, model=2): 198, ProcessCoord(pipe=6, data=1, model=3): 199, ProcessCoord(pipe=6, data=2, model=0): 200, ProcessCoord(pipe=6, data=2, model=1): 201, ProcessCoord(pipe=6, data=2, model=2): 202, ProcessCoord(pipe=6, data=2, model=3): 203, ProcessCoord(pipe=6, data=3, model=0): 204, ProcessCoord(pipe=6, data=3, model=1): 205, ProcessCoord(pipe=6, data=3, model=2): 206, ProcessCoord(pipe=6, data=3, model=3): 207, ProcessCoord(pipe=6, data=4, model=0): 208, ProcessCoord(pipe=6, data=4, model=1): 209, ProcessCoord(pipe=6, data=4, model=2): 210, ProcessCoord(pipe=6, data=4, model=3): 211, ProcessCoord(pipe=6, data=5, model=0): 212, ProcessCoord(pipe=6, data=5, model=1): 213, ProcessCoord(pipe=6, data=5, model=2): 214, ProcessCoord(pipe=6, data=5, model=3): 215, ProcessCoord(pipe=6, data=6, model=0): 216, ProcessCoord(pipe=6, data=6, model=1): 217, ProcessCoord(pipe=6, data=6, model=2): 218, ProcessCoord(pipe=6, data=6, model=3): 219, ProcessCoord(pipe=6, data=7, model=0): 220, ProcessCoord(pipe=6, data=7, model=1): 221, ProcessCoord(pipe=6, data=7, model=2): 222, ProcessCoord(pipe=6, data=7, model=3): 223, ProcessCoord(pipe=7, data=0, model=0): 224, ProcessCoord(pipe=7, data=0, model=1): 225, ProcessCoord(pipe=7, data=0, model=2): 226, ProcessCoord(pipe=7, data=0, model=3): 227, ProcessCoord(pipe=7, data=1, model=0): 228, ProcessCoord(pipe=7, data=1, model=1): 229, ProcessCoord(pipe=7, data=1, model=2): 230, ProcessCoord(pipe=7, data=1, model=3): 231, ProcessCoord(pipe=7, data=2, model=0): 232, ProcessCoord(pipe=7, data=2, model=1): 233, ProcessCoord(pipe=7, data=2, model=2): 234, ProcessCoord(pipe=7, data=2, model=3): 235, ProcessCoord(pipe=7, data=3, model=0): 236, ProcessCoord(pipe=7, data=3, model=1): 237, ProcessCoord(pipe=7, data=3, model=2): 238, ProcessCoord(pipe=7, data=3, model=3): 239, ProcessCoord(pipe=7, data=4, model=0): 240, ProcessCoord(pipe=7, data=4, model=1): 241, ProcessCoord(pipe=7, data=4, model=2): 242, ProcessCoord(pipe=7, data=4, model=3): 243, ProcessCoord(pipe=7, data=5, model=0): 244, ProcessCoord(pipe=7, data=5, model=1): 245, ProcessCoord(pipe=7, data=5, model=2): 246, ProcessCoord(pipe=7, data=5, model=3): 247, ProcessCoord(pipe=7, data=6, model=0): 248, ProcessCoord(pipe=7, data=6, model=1): 249, ProcessCoord(pipe=7, data=6, model=2): 250, ProcessCoord(pipe=7, data=6, model=3): 251, ProcessCoord(pipe=7, data=7, model=0): 252, ProcessCoord(pipe=7, data=7, model=1): 253, ProcessCoord(pipe=7, data=7, model=2): 254, ProcessCoord(pipe=7, data=7, model=3): 255, ProcessCoord(pipe=8, data=0, model=0): 256, ProcessCoord(pipe=8, data=0, model=1): 257, ProcessCoord(pipe=8, data=0, model=2): 258, ProcessCoord(pipe=8, data=0, model=3): 259, ProcessCoord(pipe=8, data=1, model=0): 260, ProcessCoord(pipe=8, data=1, model=1): 261, ProcessCoord(pipe=8, data=1, model=2): 262, ProcessCoord(pipe=8, data=1, model=3): 263, ProcessCoord(pipe=8, data=2, model=0): 264, ProcessCoord(pipe=8, data=2, model=1): 265, ProcessCoord(pipe=8, data=2, model=2): 266, ProcessCoord(pipe=8, data=2, model=3): 267, ProcessCoord(pipe=8, data=3, model=0): 268, ProcessCoord(pipe=8, data=3, model=1): 269, ProcessCoord(pipe=8, data=3, model=2): 270, ProcessCoord(pipe=8, data=3, model=3): 271, ProcessCoord(pipe=8, data=4, model=0): 272, ProcessCoord(pipe=8, data=4, model=1): 273, ProcessCoord(pipe=8, data=4, model=2): 274, ProcessCoord(pipe=8, data=4, model=3): 275, ProcessCoord(pipe=8, data=5, model=0): 276, ProcessCoord(pipe=8, data=5, model=1): 277, ProcessCoord(pipe=8, data=5, model=2): 278, ProcessCoord(pipe=8, data=5, model=3): 279, ProcessCoord(pipe=8, data=6, model=0): 280, ProcessCoord(pipe=8, data=6, model=1): 281, ProcessCoord(pipe=8, data=6, model=2): 282, ProcessCoord(pipe=8, data=6, model=3): 283, ProcessCoord(pipe=8, data=7, model=0): 284, ProcessCoord(pipe=8, data=7, model=1): 285, ProcessCoord(pipe=8, data=7, model=2): 286, ProcessCoord(pipe=8, data=7, model=3): 287, ProcessCoord(pipe=9, data=0, model=0): 288, ProcessCoord(pipe=9, data=0, model=1): 289, ProcessCoord(pipe=9, data=0, model=2): 290, ProcessCoord(pipe=9, data=0, model=3): 291, ProcessCoord(pipe=9, data=1, model=0): 292, ProcessCoord(pipe=9, data=1, model=1): 293, ProcessCoord(pipe=9, data=1, model=2): 294, ProcessCoord(pipe=9, data=1, model=3): 295, ProcessCoord(pipe=9, data=2, model=0): 296, ProcessCoord(pipe=9, data=2, model=1): 297, ProcessCoord(pipe=9, data=2, model=2): 298, ProcessCoord(pipe=9, data=2, model=3): 299, ProcessCoord(pipe=9, data=3, model=0): 300, ProcessCoord(pipe=9, data=3, model=1): 301, ProcessCoord(pipe=9, data=3, model=2): 302, ProcessCoord(pipe=9, data=3, model=3): 303, ProcessCoord(pipe=9, data=4, model=0): 304, ProcessCoord(pipe=9, data=4, model=1): 305, ProcessCoord(pipe=9, data=4, model=2): 306, ProcessCoord(pipe=9, data=4, model=3): 307, ProcessCoord(pipe=9, data=5, model=0): 308, ProcessCoord(pipe=9, data=5, model=1): 309, ProcessCoord(pipe=9, data=5, model=2): 310, ProcessCoord(pipe=9, data=5, model=3): 311, ProcessCoord(pipe=9, data=6, model=0): 312, ProcessCoord(pipe=9, data=6, model=1): 313, ProcessCoord(pipe=9, data=6, model=2): 314, ProcessCoord(pipe=9, data=6, model=3): 315, ProcessCoord(pipe=9, data=7, model=0): 316, ProcessCoord(pipe=9, data=7, model=1): 317, ProcessCoord(pipe=9, data=7, model=2): 318, ProcessCoord(pipe=9, data=7, model=3): 319, ProcessCoord(pipe=10, data=0, model=0): 320, ProcessCoord(pipe=10, data=0, model=1): 321, ProcessCoord(pipe=10, data=0, model=2): 322, ProcessCoord(pipe=10, data=0, model=3): 323, ProcessCoord(pipe=10, data=1, model=0): 324, ProcessCoord(pipe=10, data=1, model=1): 325, ProcessCoord(pipe=10, data=1, model=2): 326, ProcessCoord(pipe=10, data=1, model=3): 327, ProcessCoord(pipe=10, data=2, model=0): 328, ProcessCoord(pipe=10, data=2, model=1): 329, ProcessCoord(pipe=10, data=2, model=2): 330, ProcessCoord(pipe=10, data=2, model=3): 331, ProcessCoord(pipe=10, data=3, model=0): 332, ProcessCoord(pipe=10, data=3, model=1): 333, ProcessCoord(pipe=10, data=3, model=2): 334, ProcessCoord(pipe=10, data=3, model=3): 335, ProcessCoord(pipe=10, data=4, model=0): 336, ProcessCoord(pipe=10, data=4, model=1): 337, ProcessCoord(pipe=10, data=4, model=2): 338, ProcessCoord(pipe=10, data=4, model=3): 339, ProcessCoord(pipe=10, data=5, model=0): 340, ProcessCoord(pipe=10, data=5, model=1): 341, ProcessCoord(pipe=10, data=5, model=2): 342, ProcessCoord(pipe=10, data=5, model=3): 343, ProcessCoord(pipe=10, data=6, model=0): 344, ProcessCoord(pipe=10, data=6, model=1): 345, ProcessCoord(pipe=10, data=6, model=2): 346, ProcessCoord(pipe=10, data=6, model=3): 347, ProcessCoord(pipe=10, data=7, model=0): 348, ProcessCoord(pipe=10, data=7, model=1): 349, ProcessCoord(pipe=10, data=7, model=2): 350, ProcessCoord(pipe=10, data=7, model=3): 351, ProcessCoord(pipe=11, data=0, model=0): 352, ProcessCoord(pipe=11, data=0, model=1): 353, ProcessCoord(pipe=11, data=0, model=2): 354, ProcessCoord(pipe=11, data=0, model=3): 355, ProcessCoord(pipe=11, data=1, model=0): 356, ProcessCoord(pipe=11, data=1, model=1): 357, ProcessCoord(pipe=11, data=1, model=2): 358, ProcessCoord(pipe=11, data=1, model=3): 359, ProcessCoord(pipe=11, data=2, model=0): 360, ProcessCoord(pipe=11, data=2, model=1): 361, ProcessCoord(pipe=11, data=2, model=2): 362, ProcessCoord(pipe=11, data=2, model=3): 363, ProcessCoord(pipe=11, data=3, model=0): 364, ProcessCoord(pipe=11, data=3, model=1): 365, ProcessCoord(pipe=11, data=3, model=2): 366, ProcessCoord(pipe=11, data=3, model=3): 367, ProcessCoord(pipe=11, data=4, model=0): 368, ProcessCoord(pipe=11, data=4, model=1): 369, ProcessCoord(pipe=11, data=4, model=2): 370, ProcessCoord(pipe=11, data=4, model=3): 371, ProcessCoord(pipe=11, data=5, model=0): 372, ProcessCoord(pipe=11, data=5, model=1): 373, ProcessCoord(pipe=11, data=5, model=2): 374, ProcessCoord(pipe=11, data=5, model=3): 375, ProcessCoord(pipe=11, data=6, model=0): 376, ProcessCoord(pipe=11, data=6, model=1): 377, ProcessCoord(pipe=11, data=6, model=2): 378, ProcessCoord(pipe=11, data=6, model=3): 379, ProcessCoord(pipe=11, data=7, model=0): 380, ProcessCoord(pipe=11, data=7, model=1): 381, ProcessCoord(pipe=11, data=7, model=2): 382, ProcessCoord(pipe=11, data=7, model=3): 383} [default0]:[2022-03-04 04:02:55,582] [INFO] [module.py:365:_partition_layers] Partitioning pipeline stages with method type:transformer|embedding [default0]:stage=0 layers=8 [default0]: 0: _to_float16 [default0]: 1: EmbeddingPipe [default0]: 2: <lambda> [default0]: 3: ParallelTransformerLayerPipe [default0]: 4: ParallelTransformerLayerPipe [default0]: 5: ParallelTransformerLayerPipe [default0]: 6: ParallelTransformerLayerPipe [default0]: 7: ParallelTransformerLayerPipe [default0]:stage=1 layers=6 [default0]: 8: ParallelTransformerLayerPipe [default0]: 9: ParallelTransformerLayerPipe [default0]: 10: ParallelTransformerLayerPipe [default0]: 11: ParallelTransformerLayerPipe [default0]: 12: ParallelTransformerLayerPipe [default0]: 13: ParallelTransformerLayerPipe [default0]:stage=2 layers=6 [default0]: 14: ParallelTransformerLayerPipe [default0]: 15: ParallelTransformerLayerPipe [default0]: 16: ParallelTransformerLayerPipe [default0]: 17: ParallelTransformerLayerPipe [default0]: 18: ParallelTransformerLayerPipe [default0]: 19: ParallelTransformerLayerPipe [default0]:stage=3 layers=6 [default0]: 20: ParallelTransformerLayerPipe [default0]: 21: ParallelTransformerLayerPipe [default0]: 22: ParallelTransformerLayerPipe [default0]: 23: ParallelTransformerLayerPipe [default0]: 24: ParallelTransformerLayerPipe [default0]: 25: ParallelTransformerLayerPipe [default0]:stage=4 layers=6 [default0]: 26: ParallelTransformerLayerPipe [default0]: 27: ParallelTransformerLayerPipe [default0]: 28: ParallelTransformerLayerPipe [default0]: 29: ParallelTransformerLayerPipe [default0]: 30: ParallelTransformerLayerPipe [default0]: 31: ParallelTransformerLayerPipe [default0]:stage=5 layers=6 [default0]: 32: ParallelTransformerLayerPipe [default0]: 33: ParallelTransformerLayerPipe [default0]: 34: ParallelTransformerLayerPipe [default0]: 35: ParallelTransformerLayerPipe [default0]: 36: ParallelTransformerLayerPipe [default0]: 37: ParallelTransformerLayerPipe [default0]:stage=6 layers=6 [default0]: 38: ParallelTransformerLayerPipe [default0]: 39: ParallelTransformerLayerPipe [default0]: 40: ParallelTransformerLayerPipe [default0]: 41: ParallelTransformerLayerPipe [default0]: 42: ParallelTransformerLayerPipe [default0]: 43: ParallelTransformerLayerPipe [default0]:stage=7 layers=6 [default0]: 44: ParallelTransformerLayerPipe [default0]: 45: ParallelTransformerLayerPipe [default0]: 46: ParallelTransformerLayerPipe [default0]: 47: ParallelTransformerLayerPipe [default0]: 48: ParallelTransformerLayerPipe [default0]: 49: ParallelTransformerLayerPipe [default0]:stage=8 layers=6 [default0]: 50: ParallelTransformerLayerPipe [default0]: 51: ParallelTransformerLayerPipe [default0]: 52: ParallelTransformerLayerPipe [default0]: 53: ParallelTransformerLayerPipe [default0]: 54: ParallelTransformerLayerPipe [default0]: 55: ParallelTransformerLayerPipe [default0]:stage=9 layers=6 [default0]: 56: ParallelTransformerLayerPipe [default0]: 57: ParallelTransformerLayerPipe [default0]: 58: ParallelTransformerLayerPipe [default0]: 59: ParallelTransformerLayerPipe [default0]: 60: ParallelTransformerLayerPipe [default0]: 61: ParallelTransformerLayerPipe [default0]:stage=10 layers=6 [default0]: 62: ParallelTransformerLayerPipe [default0]: 63: ParallelTransformerLayerPipe [default0]: 64: ParallelTransformerLayerPipe [default0]: 65: ParallelTransformerLayerPipe [default0]: 66: ParallelTransformerLayerPipe [default0]: 67: ParallelTransformerLayerPipe [default0]:stage=11 layers=9 [default0]: 68: ParallelTransformerLayerPipe [default0]: 69: ParallelTransformerLayerPipe [default0]: 70: ParallelTransformerLayerPipe [default0]: 71: ParallelTransformerLayerPipe [default0]: 72: ParallelTransformerLayerPipe [default0]: 73: <lambda> [default0]: 74: MixedFusedLayerNorm [default0]: 75: EmbeddingPipe [default0]: 76: float16_to_fp32 [default0]: loss: CrossEntropy [default0]:[2022-03-04 04:02:56,733] [INFO] [utils.py:828:see_memory_usage] After Building Model [default0]:[2022-03-04 04:02:56,734] [INFO] [utils.py:829:see_memory_usage] MA 7.43 GB Max_MA 7.43 GB CA 7.45 GB Max_CA 7 GB [default0]:[2022-03-04 04:02:56,734] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.65 GB, percent = 8.7% [default0]:setting training iterations to 128728 [default0]:> learning rate decay style: cosine [default0]:DeepSpeed is enabled. [default0]:[2022-03-04 04:02:56,755] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed info: version=0.6.0+ed26ef4, git-hash=ed26ef4, git-branch=olruwase/bf16-updates [default0]:[2022-03-04 04:02:58,559] [INFO] [engine.py:278:__init__] DeepSpeed Flops Profiler Enabled: False [default0]:[2022-03-04 04:02:58,559] [INFO] [engine.py:1092:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer [default0]:[2022-03-04 04:02:58,559] [INFO] [engine.py:1098:_configure_optimizer] Using client Optimizer as basic optimizer [default0]:[2022-03-04 04:02:58,560] [INFO] [engine.py:1114:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam [default0]:[2022-03-04 04:02:58,560] [INFO] [engine.py:1328:_configure_bf16_optimizer] Creating unfused BF16 optimizer [default0]:[2022-03-04 04:02:58,619] [INFO] [utils.py:828:see_memory_usage] begin bf16_optimizer [default0]:[2022-03-04 04:02:58,620] [INFO] [utils.py:829:see_memory_usage] MA 7.42 GB Max_MA 7.43 GB CA 7.45 GB Max_CA 7 GB [default0]:[2022-03-04 04:02:58,620] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 44.0 GB, percent = 8.7% [default0]:[2022-03-04 04:02:58,646] [INFO] [utils.py:828:see_memory_usage] before initializing group 0 [default0]:[2022-03-04 04:02:58,646] [INFO] [utils.py:829:see_memory_usage] MA 7.42 GB Max_MA 7.42 GB CA 7.45 GB Max_CA 7 GB [default0]:[2022-03-04 04:02:58,646] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 44.0 GB, percent = 8.7% [default0]:[2022-03-04 04:02:58,701] [INFO] [utils.py:828:see_memory_usage] after initializing group 0 [default0]:[2022-03-04 04:02:58,702] [INFO] [utils.py:829:see_memory_usage] MA 17.01 GB Max_MA 17.01 GB CA 20.23 GB Max_CA 20 GB [default0]:[2022-03-04 04:02:58,702] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 44.0 GB, percent = 8.7% [default0]:[2022-03-04 04:02:58,728] [INFO] [utils.py:828:see_memory_usage] before initializing group 1 [default0]:[2022-03-04 04:02:58,728] [INFO] [utils.py:829:see_memory_usage] MA 17.01 GB Max_MA 17.01 GB CA 20.23 GB Max_CA 20 GB [default0]:[2022-03-04 04:02:58,729] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 44.0 GB, percent = 8.7% [default0]:[2022-03-04 04:02:58,775] [INFO] [utils.py:828:see_memory_usage] after initializing group 1 [default0]:[2022-03-04 04:02:58,775] [INFO] [utils.py:829:see_memory_usage] MA 24.11 GB Max_MA 24.11 GB CA 30.5 GB Max_CA 30 GB [default0]:[2022-03-04 04:02:58,775] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 44.0 GB, percent = 8.7% [default0]:[2022-03-04 04:02:58,799] [INFO] [utils.py:828:see_memory_usage] before initializing group 2 [default0]:[2022-03-04 04:02:58,799] [INFO] [utils.py:829:see_memory_usage] MA 24.11 GB Max_MA 24.11 GB CA 30.5 GB Max_CA 30 GB [default0]:[2022-03-04 04:02:58,799] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 44.0 GB, percent = 8.7% [default0]:[2022-03-04 04:02:58,824] [INFO] [utils.py:828:see_memory_usage] after initializing group 2 [default0]:[2022-03-04 04:02:58,824] [INFO] [utils.py:829:see_memory_usage] MA 24.12 GB Max_MA 24.12 GB CA 30.5 GB Max_CA 30 GB [default0]:[2022-03-04 04:02:58,824] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 44.0 GB, percent = 8.7% [default0]:[2022-03-04 04:02:58,847] [INFO] [utils.py:828:see_memory_usage] before initialize_optimizer [default0]:[2022-03-04 04:02:58,848] [INFO] [utils.py:829:see_memory_usage] MA 24.12 GB Max_MA 24.12 GB CA 30.5 GB Max_CA 30 GB [default0]:[2022-03-04 04:02:58,848] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 44.0 GB, percent = 8.7% [default0]:[2022-03-04 04:02:58,898] [INFO] [utils.py:828:see_memory_usage] end initialize_optimizer [default0]:[2022-03-04 04:02:58,898] [INFO] [utils.py:829:see_memory_usage] MA 27.82 GB Max_MA 27.82 GB CA 34.21 GB Max_CA 34 GB [default0]:[2022-03-04 04:02:58,898] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 44.0 GB, percent = 8.7% [default0]:[2022-03-04 04:02:58,920] [INFO] [utils.py:828:see_memory_usage] end bf16_optimizer [default0]:[2022-03-04 04:02:58,920] [INFO] [utils.py:829:see_memory_usage] MA 27.82 GB Max_MA 27.82 GB CA 34.21 GB Max_CA 34 GB [default0]:[2022-03-04 04:02:58,920] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 44.0 GB, percent = 8.7% [default0]:[2022-03-04 04:02:58,920] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam [default0]:[2022-03-04 04:02:58,921] [INFO] [engine.py:795:_configure_lr_scheduler] DeepSpeed using client LR scheduler [default0]:[2022-03-04 04:02:58,921] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed LR Scheduler = <megatron.learning_rates.AnnealingLR object at 0x14b8b4aa15b0> [default0]:[2022-03-04 04:02:58,921] [INFO] [logging.py:69:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0, 0.0], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)] [default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1057:print] DeepSpeedEngine configuration: [default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1061:print] activation_checkpointing_config { [default0]: "partition_activations": false, [default0]: "contiguous_memory_optimization": false, [default0]: "cpu_checkpointing": false, [default0]: "number_checkpoints": null, [default0]: "synchronize_checkpoint_boundary": false, [default0]: "profile": false [default0]:} [default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1061:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1061:print] amp_enabled .................. False [default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1061:print] amp_params ................... False [default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1061:print] autotuning_config ............ { [default0]: "enabled": false, [default0]: "start_step": null, [default0]: "end_step": null, [default0]: "metric_path": null, [default0]: "arg_mappings": null, [default0]: "metric": "throughput", [default0]: "model_info": null, [default0]: "results_dir": null, [default0]: "exps_dir": null, [default0]: "overwrite": true, [default0]: "fast": true, [default0]: "start_profile_step": 3, [default0]: "end_profile_step": 5, [default0]: "tuner_type": "gridsearch", [default0]: "tuner_early_stopping": 5, [default0]: "tuner_num_trials": 50, [default0]: "model_info_path": null, [default0]: "mp_size": 1, [default0]: "max_train_batch_size": null, [default0]: "min_train_batch_size": 1, [default0]: "max_train_micro_batch_size_per_gpu": 1.024000e+03, [default0]: "min_train_micro_batch_size_per_gpu": 1, [default0]: "num_tuning_micro_batch_sizes": 3 [default0]:} [default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1061:print] bfloat16_enabled ............. True [default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1061:print] checkpoint_tag_validation_enabled True [default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1061:print] checkpoint_tag_validation_fail False [default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1061:print] communication_data_type ...... None [default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1061:print] curriculum_enabled ........... False [default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1061:print] curriculum_params ............ False [default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1061:print] dataloader_drop_last ......... False [default0]:[2022-03-04 04:02:58,921] [INFO] [config.py:1061:print] disable_allgather ............ False [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] dump_state ................... False [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] dynamic_loss_scale_args ...... None [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] eigenvalue_enabled ........... False [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] eigenvalue_gas_boundary_resolution 1 [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] eigenvalue_layer_name ........ bert.encoder.layer [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] eigenvalue_layer_num ......... 0 [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] eigenvalue_max_iter .......... 100 [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] eigenvalue_stability ......... 1e-06 [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] eigenvalue_tol ............... 0.01 [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] eigenvalue_verbose ........... False [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] elasticity_enabled ........... False [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] flops_profiler_config ........ { [default0]: "enabled": false, [default0]: "profile_step": 1, [default0]: "module_depth": -1, [default0]: "top_modules": 1, [default0]: "detailed": true, [default0]: "output_file": null [default0]:} [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] fp16_enabled ................. False [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] fp16_master_weights_and_gradients False [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] fp16_mixed_quantize .......... False [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] global_rank .................. 0 [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] gradient_accumulation_steps .. 128 [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] gradient_clipping ............ 1.0 [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] gradient_predivide_factor .... 1.0 [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] initial_dynamic_scale ........ 1 [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] loss_scale ................... 1.0 [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] memory_breakdown ............. False [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] optimizer_legacy_fusion ...... False [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] optimizer_name ............... None [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] optimizer_params ............. None [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] pld_enabled .................. False [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] pld_params ................... False [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] prescale_gradients ........... False [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] quantize_change_rate ......... 0.001 [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] quantize_groups .............. 1 [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] quantize_offset .............. 1000 [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] quantize_period .............. 1000 [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] quantize_rounding ............ 0 [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] quantize_start_bits .......... 16 [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] quantize_target_bits ......... 8 [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] quantize_training_enabled .... False [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] quantize_type ................ 0 [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] quantize_verbose ............. False [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] scheduler_name ............... None [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] scheduler_params ............. None [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] sparse_attention ............. None [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] sparse_gradients_enabled ..... False [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] steps_per_print .............. 2000 [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] tensorboard_enabled .......... False [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] tensorboard_job_name ......... DeepSpeedJobName [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] tensorboard_output_path ...... [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] train_batch_size ............. 2048 [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] train_micro_batch_size_per_gpu 2 [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] use_quantizer_kernel ......... False [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] wall_clock_breakdown ......... False [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] world_size ................... 8 [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] zero_allow_untested_optimizer False [default0]:[2022-03-04 04:02:58,922] [INFO] [config.py:1061:print] zero_config .................. { [default0]: "stage": 0, [default0]: "contiguous_gradients": true, [default0]: "reduce_scatter": true, [default0]: "reduce_bucket_size": 5.000000e+08, [default0]: "allgather_partitions": true, [default0]: "allgather_bucket_size": 5.000000e+08, [default0]: "overlap_comm": false, [default0]: "load_from_fp32_weights": true, [default0]: "elastic_checkpoint": false, [default0]: "offload_param": null, [default0]: "offload_optimizer": null, [default0]: "sub_group_size": 1.000000e+09, [default0]: "prefetch_bucket_size": 5.000000e+07, [default0]: "param_persistence_threshold": 1.000000e+05, [default0]: "max_live_parameters": 1.000000e+09, [default0]: "max_reuse_distance": 1.000000e+09, [default0]: "gather_16bit_weights_on_model_save": false, [default0]: "ignore_unused_parameters": true, [default0]: "round_robin_gradients": false, [default0]: "legacy_stage1": false [default0]:} [default0]:[2022-03-04 04:02:58,923] [INFO] [config.py:1061:print] zero_enabled ................. False [default0]:[2022-03-04 04:02:58,923] [INFO] [config.py:1061:print] zero_optimization_stage ...... 0 [default0]:[2022-03-04 04:02:58,923] [INFO] [config.py:1063:print] json = { [default0]: "train_micro_batch_size_per_gpu": 2, [default0]: "train_batch_size": 2.048000e+03, [default0]: "gradient_clipping": 1.0, [default0]: "zero_optimization": { [default0]: "stage": 0 [default0]: }, [default0]: "bf16": { [default0]: "enabled": true [default0]: }, [default0]: "steps_per_print": 2.000000e+03, [default0]: "wall_clock_breakdown": false [default0]:} [default0]:[2022-03-04 04:02:58,923] [INFO] [engine.py:93:__init__] CONFIG: micro_batches=128 micro_batch_size=2 [default1]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=97 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=193 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=98 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=99 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=35 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=32 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=33 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=225 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=34 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=64 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=131 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=130 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=129 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=128 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=290 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=289 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=291 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=288 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=160 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=2 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=1 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=0 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=3 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=257 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=256 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=259 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=258 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=162 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=161 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=66 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=163 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=320 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=322 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=323 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=321 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=192 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=194 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=195 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=65 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=226 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=224 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=227 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=353 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=355 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=354 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-04 04:03:00,195] [INFO] [engine.py:151:__init__] RANK=352 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=67 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-04 04:03:00,196] [INFO] [engine.py:151:__init__] RANK=96 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]: > using checkpoint value 6e-05 for learning rate [default0]: > using checkpoint value 6e-06 for minimum learning rate [default0]: > using checkpoint value 183105 for warmup iterations [default0]: > using checkpoint value 200000000 for total number of iterations [default0]: > using checkpoint value cosine for decay style [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]:[2022-03-04 04:03:13,219] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 276 [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default0]:[2022-03-04 04:03:13,438] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 272 [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default5]:[2022-03-04 04:03:13,878] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 277 [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default4]:[2022-03-04 04:03:14,053] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 356 [default0]:[2022-03-04 04:03:14,141] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 352 [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default7]:[2022-03-04 04:03:14,487] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 279 [default1]:[2022-03-04 04:03:14,986] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 273 [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default2]:[2022-03-04 04:03:14,955] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 274 [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default0]:[2022-03-04 04:03:15,080] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 136 [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default0]:[2022-03-04 04:03:15,221] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 336 [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default4]:[2022-03-04 04:03:15,287] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 140 [default3]:[2022-03-04 04:03:15,253] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 275 [default0]:[2022-03-04 04:03:15,499] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 328 [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default6]:[2022-03-04 04:03:15,563] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 278 WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247065 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247066 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247067 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247068 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247070 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246692 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246693 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246694 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246695 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247071 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246698 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 246699 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 247072 closing signal SIGTERM [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default4]:[2022-03-04 04:03:15,830] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 340 [default2]:[2022-03-04 04:03:16,518] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 138 [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default4]:[2022-03-04 04:03:16,438] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 252 [default4]:[2022-03-04 04:03:16,479] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 4 [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default0]:[2022-03-04 04:03:16,688] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 200 [default2]:[2022-03-04 04:03:16,668] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 314 [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default4]:[2022-03-04 04:03:16,638] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 308 [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default4]:[2022-03-04 04:03:16,828] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 348 [default0]:[2022-03-04 04:03:16,969] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 312 [default0]:[2022-03-04 04:03:16,951] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 120 [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default4]:[2022-03-04 04:03:17,091] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 196 [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default0]:[2022-03-04 04:03:17,045] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 344 [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default1]:[2022-03-04 04:03:17,078] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 305 [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default0]:[2022-03-04 04:03:17,186] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 248 [default4]:[2022-03-04 04:03:17,207] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 156 [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default0]:[2022-03-04 04:03:17,298] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 184 [default4]:[2022-03-04 04:03:17,279] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 124 [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default2]:[2022-03-04 04:03:17,247] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 122 [default7]:[2022-03-04 04:03:17,291] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 343 [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default3]:[2022-03-04 04:03:17,323] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 339 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 246696) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python [default2]:[2022-03-04 04:03:17,384] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 250 [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default4]:[2022-03-04 04:03:17,445] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 36 [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]:[2022-03-04 04:03:17,573] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 141 [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default7]:[2022-03-04 04:03:17,539] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 47 [default2]:[2022-03-04 04:03:17,531] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 170 [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default0]:[2022-03-04 04:03:17,530] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 152 [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default0]:[2022-03-04 04:03:17,627] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 176 [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default0]:[2022-03-04 04:03:17,701] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 168 [default5]:[2022-03-04 04:03:17,705] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 125 [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default1]:[2022-03-04 04:03:17,781] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 249 [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]:[2022-03-04 04:03:17,748] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 306 [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default3]:[2022-03-04 04:03:17,877] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 331 [default0]:[2022-03-04 04:03:17,874] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 280 [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]:[2022-03-04 04:03:17,913] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 351 [default3]:[2022-03-04 04:03:17,922] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 315 [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default5]:[2022-03-04 04:03:17,872] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 253 [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default1]:[2022-03-04 04:03:17,849] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 121 [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default4]:[2022-03-04 04:03:17,986] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 324 [default4]:[2022-03-04 04:03:17,945] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 372 [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default5]:[2022-03-04 04:03:17,933] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 173 [default4]:[2022-03-04 04:03:18,048] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 300 [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default1]:[2022-03-04 04:03:18,119] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 345 [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default0]:[2022-03-04 04:03:18,115] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 304 [default0]:[2022-03-04 04:03:18,174] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 296 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 247069) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default0]:[2022-03-04 04:03:18,202] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 160 [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default4]:[2022-03-04 04:03:18,150] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 164 [default0]:[2022-03-04 04:03:18,230] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 320 [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default1]:[2022-03-04 04:03:18,258] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 137 [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default6]:[2022-03-04 04:03:18,322] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 334 [default4]:[2022-03-04 04:03:18,389] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 236 [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default4]:[2022-03-04 04:03:18,375] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 244 [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default4]:[2022-03-04 04:03:18,327] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 332 [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default7]:[2022-03-04 04:03:18,364] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 319 [default5]:[2022-03-04 04:03:18,426] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 349 [default1]:[2022-03-04 04:03:18,465] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 329 [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default3]:[2022-03-04 04:03:18,475] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 347 [default7]:[2022-03-04 04:03:18,437] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 335 [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default6]:[2022-03-04 04:03:18,466] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 310 [default3]:[2022-03-04 04:03:18,460] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 307 [default3]:[2022-03-04 04:03:18,485] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 251 [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default4]:[2022-03-04 04:03:18,493] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 188 [default4]:[2022-03-04 04:03:18,522] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 172 [default3]:[2022-03-04 04:03:18,525] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 171 [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default1]:[2022-03-04 04:03:18,462] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 161 [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default0]:[2022-03-04 04:03:18,502] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 0 [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default7]:[2022-03-04 04:03:18,519] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 143 [default1]:[2022-03-04 04:03:18,582] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 241 [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default6]:[2022-03-04 04:03:18,596] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 142 [default4]:[2022-03-04 04:03:18,551] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 76 [default1]:[2022-03-04 04:03:18,568] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 313 [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default7]:[2022-03-04 04:03:18,567] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 255 [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default2]:[2022-03-04 04:03:18,630] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 338 [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default2]:[2022-03-04 04:03:18,626] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 322 [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]:[2022-03-04 04:03:18,704] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 321 [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default7]:[2022-03-04 04:03:18,636] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 175 [default4]:Traceback (most recent call last): [default4]:[2022-03-04 04:03:18,708] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 108 [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default1]:[2022-03-04 04:03:18,629] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 201 [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default0]:[2022-03-04 04:03:18,669] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 368 [default3]:[2022-03-04 04:03:18,665] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 203 [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default2]:[2022-03-04 04:03:18,789] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 194 [default2]:[2022-03-04 04:03:18,784] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 330 [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default5]:[2022-03-04 04:03:18,794] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 285 [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default6]:[2022-03-04 04:03:18,759] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 174 [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default0]:[2022-03-04 04:03:18,902] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 32 [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default0]:[2022-03-04 04:03:18,891] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 72 [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default5]:[2022-03-04 04:03:18,867] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 205 [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]:[2022-03-04 04:03:18,860] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 204 [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]:[2022-03-04 04:03:18,830] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 104 [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default0]:[2022-03-04 04:03:18,910] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 264 [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default6]:[2022-03-04 04:03:18,922] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 326 [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default6]:[2022-03-04 04:03:19,016] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 254 [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default7]:[2022-03-04 04:03:18,949] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 127 [default6]:[2022-03-04 04:03:19,054] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 318 [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]:[2022-03-04 04:03:19,072] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 144 [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]:[2022-03-04 04:03:19,061] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 284 [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default6]:Traceback (most recent call last): [default5]:[2022-03-04 04:03:19,064] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 317 [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default6]:[2022-03-04 04:03:19,118] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 342 [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default5]:[2022-03-04 04:03:19,185] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 37 [default3]:[2022-03-04 04:03:19,131] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 139 [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default7]:[2022-03-04 04:03:19,148] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 191 [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]:[2022-03-04 04:03:19,186] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 341 [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default2]:[2022-03-04 04:03:19,317] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 178 [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]:[2022-03-04 04:03:19,305] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 208 [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]:[2022-03-04 04:03:19,263] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 40 [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default0]:[2022-03-04 04:03:19,234] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 24 [default1]:[2022-03-04 04:03:19,323] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 281 [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default4]:[2022-03-04 04:03:19,331] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 28 [default4]:[2022-03-04 04:03:19,425] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 44 [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default2]:[2022-03-04 04:03:19,408] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 202 [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default6]:[2022-03-04 04:03:19,401] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 126 [default5]:[2022-03-04 04:03:19,412] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 53 [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default5]:[2022-03-04 04:03:19,435] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 325 [default3]:[2022-03-04 04:03:19,418] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 323 [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default0]:[2022-03-04 04:03:19,466] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 192 [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default6]:[2022-03-04 04:03:19,427] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 198 [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]:Traceback (most recent call last): [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: main() [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default6]:[2022-03-04 04:03:19,445] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 350 [default1]:[2022-03-04 04:03:19,484] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 169 [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default5]:[2022-03-04 04:03:19,524] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 333 [default6]:[2022-03-04 04:03:19,507] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 158 [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default6]:[2022-03-04 04:03:19,435] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 206 [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default7]:[2022-03-04 04:03:19,593] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 327 [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default2]:[2022-03-04 04:03:19,538] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 346 [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default7]:[2022-03-04 04:03:19,611] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 207 [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default5]:[2022-03-04 04:03:19,571] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 189 [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default5]:[2022-03-04 04:03:19,594] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 309 [default2]:[2022-03-04 04:03:19,591] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 154 [default4]:[2022-03-04 04:03:19,560] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 316 [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default7]:[2022-03-04 04:03:19,606] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 167 [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default1]:[2022-03-04 04:03:19,592] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 337 [default0]:[2022-03-04 04:03:19,714] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 360 [default3]:[2022-03-04 04:03:19,655] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 179 [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default1]:[2022-03-04 04:03:19,653] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 33 [default3]:[2022-03-04 04:03:19,702] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 283 [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default3]:[2022-03-04 04:03:19,678] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 187 [default7]:[2022-03-04 04:03:19,635] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 119 [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default2]:[2022-03-04 04:03:19,712] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 50 [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default0]:[2022-03-04 04:03:19,775] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 16 [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default0]:[2022-03-04 04:03:19,789] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 232 [default7]:[2022-03-04 04:03:19,757] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 303 [default0]:[2022-03-04 04:03:19,786] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 376 [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default4]:[2022-03-04 04:03:19,769] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 268 [default3]:[2022-03-04 04:03:19,798] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 123 [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default6]:[2022-03-04 04:03:19,891] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 214 [default7]:[2022-03-04 04:03:19,881] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 199 [default5]:[2022-03-04 04:03:19,875] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 245 [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default3]:[2022-03-04 04:03:19,836] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 43 [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default5]:[2022-03-04 04:03:19,869] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 117 [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]:[2022-03-04 04:03:19,838] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 287 [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]:[2022-03-04 04:03:19,838] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 159 [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default5]:[2022-03-04 04:03:19,848] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 157 [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]:[2022-03-04 04:03:19,930] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 49 [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]:[2022-03-04 04:03:19,989] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 180 [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default6]:[2022-03-04 04:03:19,929] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 302 [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]:[2022-03-04 04:03:19,972] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 193 [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default1]:[2022-03-04 04:03:19,965] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 41 [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]:[2022-03-04 04:03:19,939] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 38 [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default3]:[2022-03-04 04:03:19,966] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 155 [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]:[2022-03-04 04:03:19,967] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 373 [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default2]:[2022-03-04 04:03:20,064] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 18 [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default5]:[2022-03-04 04:03:20,069] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 197 [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default0]:[2022-03-04 04:03:20,082] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 240 [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]:[2022-03-04 04:03:20,085] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 20 [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default5]:[2022-03-04 04:03:20,074] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 45 [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default4]:[2022-03-04 04:03:20,083] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 292 [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]:[2022-03-04 04:03:20,114] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 34 [default1]: return f(*args, **kwargs) [default5]:[2022-03-04 04:03:20,045] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 165 [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]:[2022-03-04 04:03:20,116] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 12 [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default2]:[2022-03-04 04:03:20,092] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 162 [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]:[2022-03-04 04:03:20,216] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 182 [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]:[2022-03-04 04:03:20,216] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 183 [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default0]:[2022-03-04 04:03:20,309] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 80 [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:[2022-03-04 04:03:20,307] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 21 [default6]:KeyError: 'clip_grad' [default2]:[2022-03-04 04:03:20,326] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 114 [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]:[2022-03-04 04:03:20,314] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 380 [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]:[2022-03-04 04:03:20,264] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 186 [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]:[2022-03-04 04:03:20,252] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 10 [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default7]:[2022-03-04 04:03:20,254] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 311 [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default3]:[2022-03-04 04:03:20,396] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 235 [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]:[2022-03-04 04:03:20,409] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 195 [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]:[2022-03-04 04:03:20,341] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 46 [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default4]:[2022-03-04 04:03:20,378] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 60 [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]:[2022-03-04 04:03:20,388] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 286 [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default5]:[2022-03-04 04:03:20,346] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 109 [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default1]:[2022-03-04 04:03:20,375] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 105 [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]:[2022-03-04 04:03:20,372] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 260 [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:[2022-03-04 04:03:20,341] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 163 [default1]:KeyError: 'clip_grad' [default6]:[2022-03-04 04:03:20,418] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 54 [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default1]:[2022-03-04 04:03:20,464] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 297 [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]:[2022-03-04 04:03:20,429] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 299 [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default1]:[2022-03-04 04:03:20,450] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 177 [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default4]:[2022-03-04 04:03:20,471] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 84 [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]:[2022-03-04 04:03:20,472] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 116 [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:[2022-03-04 04:03:20,488] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 98 [default5]:KeyError: 'clip_grad' [default1]:[2022-03-04 04:03:20,531] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 153 [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]:[2022-03-04 04:03:20,504] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 223 [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default7]:Traceback (most recent call last): [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]:[2022-03-04 04:03:20,525] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 211 [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default7]:[2022-03-04 04:03:20,574] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 39 [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]:[2022-03-04 04:03:20,562] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 42 [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default1]:[2022-03-04 04:03:20,550] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 113 [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default2]:[2022-03-04 04:03:20,597] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 26 [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]:[2022-03-04 04:03:20,572] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 30 [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:[2022-03-04 04:03:20,605] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 282 [default2]:KeyError: 'clip_grad' [default2]:[2022-03-04 04:03:20,582] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 106 [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]:[2022-03-04 04:03:20,588] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 190 [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default5]:[2022-03-04 04:03:20,585] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 301 [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]:[2022-03-04 04:03:20,559] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 166 [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default7]:[2022-03-04 04:03:20,628] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 55 [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]:KeyError: 'clip_grad' [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default5]:[2022-03-04 04:03:20,636] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 181 [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]:[2022-03-04 04:03:20,674] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 73 [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default3]:[2022-03-04 04:03:20,656] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 35 [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default7]:[2022-03-04 04:03:20,703] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 111 [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]:[2022-03-04 04:03:20,679] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 185 [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default4]:[2022-03-04 04:03:20,684] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 52 [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254451 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254454 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 245945 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 245946 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 287078 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 287079 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 287080 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 245948 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 287081 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 287082 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 245950 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 287083 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 287084 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 245951 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 245952 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 245863 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 245867 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 245868 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253186 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253187 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253188 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253189 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253191 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253192 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253193 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 248196 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 248197 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 248198 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 248199 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 248200 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 248202 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 248203 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 230217 closing signal SIGTERM [default2]:[2022-03-04 04:03:20,776] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 298 WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 259744 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249069 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 259745 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249070 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 259746 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242641 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 259747 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249071 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242642 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 259748 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 259749 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242643 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249072 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 259750 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242644 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242646 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249074 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249075 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 249076 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242647 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 242648 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 231484 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 231485 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 231486 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 231487 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 231489 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 231490 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 231491 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 210933 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 210934 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229013 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229014 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229015 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229017 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229018 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 229019 closing signal SIGTERM [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default7]:[2022-03-04 04:03:20,729] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 375 [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]:[2022-03-04 04:03:20,794] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 371 [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default3]:[2022-03-04 04:03:20,820] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 107 WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 297992 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 297993 closing signal SIGTERM [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]:[2022-03-04 04:03:20,805] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 110 WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 297994 closing signal SIGTERM [default6]:[2022-03-04 04:03:20,787] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 118 [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 297996 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 297997 closing signal SIGTERM [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 297998 closing signal SIGTERM [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 257949 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 257952 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 248427 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 248431 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 108419 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255609 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255610 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255612 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255614 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 255616 closing signal SIGTERM [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:[2022-03-04 04:03:20,858] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 364 [default4]:KeyError: 'clip_grad' [default5]:[2022-03-04 04:03:20,878] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 77 [default3]:[2022-03-04 04:03:20,878] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 19 [default7]:[2022-03-04 04:03:20,840] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 79 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 250780) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default6]:[2022-03-04 04:03:20,899] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 270 [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default7]:[2022-03-04 04:03:20,883] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 263 [default0]:[2022-03-04 04:03:20,920] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 8 [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default7]:[2022-03-04 04:03:20,905] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 271 WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 228356 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 228359 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 228361 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 228363 closing signal SIGTERM [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 248343 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 248347 closing signal SIGTERM [default7]:[2022-03-04 04:03:20,959] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 239 [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default1]:[2022-03-04 04:03:20,977] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 233 [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default5]:[2022-03-04 04:03:21,002] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 213 [default2]:[2022-03-04 04:03:20,947] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 210 [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]:[2022-03-04 04:03:20,943] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 148 [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default7]:Traceback (most recent call last): [default7]:[2022-03-04 04:03:20,961] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 31 [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default1]:[2022-03-04 04:03:20,974] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 25 [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default1]:[2022-03-04 04:03:20,959] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 17 [default3]:[2022-03-04 04:03:20,988] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 267 [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default6]:[2022-03-04 04:03:21,001] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 294 [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default1]:[2022-03-04 04:03:20,998] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 9 [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default5]:[2022-03-04 04:03:21,014] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 13 [default6]:[2022-03-04 04:03:20,982] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 382 [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default4]:[2022-03-04 04:03:20,971] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 220 [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default0]:[2022-03-04 04:03:20,999] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 48 [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default5]:[2022-03-04 04:03:21,087] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 237 [default1]:[2022-03-04 04:03:21,081] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 89 [default6]:[2022-03-04 04:03:21,090] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 22 [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default6]:[2022-03-04 04:03:21,084] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 62 [default1]:Traceback (most recent call last): [default4]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]:[2022-03-04 04:03:21,103] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 215 [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default1]:[2022-03-04 04:03:21,070] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 81 [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default3]:[2022-03-04 04:03:21,109] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 75 [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default0]:[2022-03-04 04:03:21,100] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 112 [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]:[2022-03-04 04:03:21,079] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 212 [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default3]:[2022-03-04 04:03:21,036] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 115 [default4]:[2022-03-04 04:03:21,081] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 92 [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default7]:[2022-03-04 04:03:21,079] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 15 [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default7]:[2022-03-04 04:03:21,116] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 295 [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]:[2022-03-04 04:03:21,060] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 51 [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default6]:[2022-03-04 04:03:21,179] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 102 [default1]:[2022-03-04 04:03:21,195] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 361 [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default2]:[2022-03-04 04:03:21,212] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 58 [default2]:[2022-03-04 04:03:21,141] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 82 [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default3]:[2022-03-04 04:03:21,209] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 27 [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default3]:[2022-03-04 04:03:21,139] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 219 [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default7]:[2022-03-04 04:03:21,203] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 383 [default2]:[2022-03-04 04:03:21,148] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 74 [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default3]:[2022-03-04 04:03:21,175] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 259 [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default1]:[2022-03-04 04:03:21,165] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 377 [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default3]:[2022-03-04 04:03:21,254] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 363 [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default6]:[2022-03-04 04:03:21,243] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 150 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 230214) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default3]:[2022-03-04 04:03:21,264] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 11 [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default5]:Traceback (most recent call last): [default5]:[2022-03-04 04:03:21,317] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 269 [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default5]:[2022-03-04 04:03:21,322] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 93 [default0]:[2022-03-04 04:03:21,371] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 56 [default6]:[2022-03-04 04:03:21,324] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 366 [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default7]:[2022-03-04 04:03:21,347] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 23 [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default7]:[2022-03-04 04:03:21,359] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 63 [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default1]:[2022-03-04 04:03:21,431] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 265 [default0]:[2022-03-04 04:03:21,391] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 288 [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default6]:[2022-03-04 04:03:21,403] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 262 [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:[2022-03-04 04:03:21,432] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 379 [default3]:KeyError: 'clip_grad' [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default0]:[2022-03-04 04:03:21,376] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 216 [default0]:[2022-03-04 04:03:21,479] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 88 [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default1]:[2022-03-04 04:03:21,492] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 57 [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]:[2022-03-04 04:03:21,432] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 362 [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]:[2022-03-04 04:03:21,437] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 365 [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default5]:Traceback (most recent call last): [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]:KeyError: 'clip_grad' [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 254449) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python [default1]:[2022-03-04 04:03:21,519] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 209 [default2]:[2022-03-04 04:03:21,514] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 266 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 108415) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default5]:[2022-03-04 04:03:21,511] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 293 [default6]:[2022-03-04 04:03:21,512] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 14 [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]:[2022-03-04 04:03:21,497] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 258 [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default2]:Traceback (most recent call last): [default2]:[2022-03-04 04:03:21,460] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 378 [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default2]:[2022-03-04 04:03:21,519] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 218 [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]:[2022-03-04 04:03:21,549] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 91 [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default7]:[2022-03-04 04:03:21,562] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 367 [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default7]:[2022-03-04 04:03:21,536] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 103 [default5]:[2022-03-04 04:03:21,546] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 61 [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default5]:[2022-03-04 04:03:21,562] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 85 [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]:[2022-03-04 04:03:21,542] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 147 [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default4]:[2022-03-04 04:03:21,539] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 100 [default6]:[2022-03-04 04:03:21,559] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 78 [default6]:[2022-03-04 04:03:21,538] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 222 [default1]:[2022-03-04 04:03:21,633] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 217 [default5]:[2022-03-04 04:03:21,544] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 221 [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 248426) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default6]:[2022-03-04 04:03:21,642] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 238 [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default7]:[2022-03-04 04:03:21,644] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 87 [default7]:[2022-03-04 04:03:21,644] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 95 [default2]:[2022-03-04 04:03:21,708] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 234 [default3]:[2022-03-04 04:03:21,686] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 83 [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default5]:[2022-03-04 04:03:21,711] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 29 [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 248345) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python [default6]:[2022-03-04 04:03:21,652] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 94 [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default1]:[2022-03-04 04:03:21,640] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 289 [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default3]:[2022-03-04 04:03:21,696] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 291 [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]:[2022-03-04 04:03:21,645] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 256 [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:[2022-03-04 04:03:21,662] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 257 [default0]:KeyError: 'clip_grad' [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default5]:[2022-03-04 04:03:21,691] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 261 [default5]:[2022-03-04 04:03:21,755] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 381 [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default0]:[2022-03-04 04:03:21,782] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 96 [default1]:[2022-03-04 04:03:21,801] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 97 [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default3]:[2022-03-04 04:03:21,766] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 59 [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]:[2022-03-04 04:03:21,790] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 90 [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default6]:[2022-03-04 04:03:21,791] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 86 [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default2]:Traceback (most recent call last): [default2]:[2022-03-04 04:03:21,789] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 146 [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default1]:[2022-03-04 04:03:21,823] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 145 [default5]:[2022-03-04 04:03:21,778] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 149 [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default2]:[2022-03-04 04:03:21,760] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 290 [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 210935) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 257946) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default5]:[2022-03-04 04:03:21,888] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 101 [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default0]:[2022-03-04 04:03:21,907] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 64 [default3]:[2022-03-04 04:03:21,981] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 99 [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default7]:[2022-03-04 04:03:21,970] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 151 [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 245861) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python [default4]:Traceback (most recent call last): [default4]:[2022-03-04 04:03:21,970] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 132 [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' [default0]:[2022-03-04 04:03:22,033] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 224 [default2]:[2022-03-04 04:03:22,108] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 226 [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 255613) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python [default4]:Traceback (most recent call last): [default4]:[2022-03-04 04:03:22,186] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 68 [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default2]:[2022-03-04 04:03:22,220] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 66 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 228360) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' [default5]:[2022-03-04 04:03:22,249] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 229 [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default5]:[2022-03-04 04:03:22,306] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 69 [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' [default7]:[2022-03-04 04:03:22,256] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 231 [default4]:[2022-03-04 04:03:22,372] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 228 [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default3]:[2022-03-04 04:03:22,370] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 67 [default4]:Traceback (most recent call last): [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default4]: main() [default4]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default4]: return f(*args, **kwargs) [default4]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default4]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default4]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default4]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default4]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default4]: success = self._load_zero_checkpoint( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default4]: self.optimizer.load_state_dict( [default4]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default4]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default4]:KeyError: 'clip_grad' [default6]:[2022-03-04 04:03:22,380] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 230 [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' [default6]:[2022-03-04 04:03:22,425] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 134 [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default3]:[2022-03-04 04:03:22,424] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 227 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 229012) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python [default0]:[2022-03-04 04:03:22,474] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 128 [default0]:Traceback (most recent call last): [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default0]: main() [default0]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default0]: return f(*args, **kwargs) [default0]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default0]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default0]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default0]: success = self._load_zero_checkpoint( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default0]: self.optimizer.load_state_dict( [default0]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default0]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default0]:KeyError: 'clip_grad' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 231488) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python [default6]:Traceback (most recent call last): [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default6]: main() [default6]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default6]: return f(*args, **kwargs) [default6]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default6]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default6]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default6]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default6]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default6]: success = self._load_zero_checkpoint( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default6]: self.optimizer.load_state_dict( [default6]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default6]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default6]:KeyError: 'clip_grad' [default1]:[2022-03-04 04:03:22,584] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 65 [default7]:[2022-03-04 04:03:22,576] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 71 [default6]:[2022-03-04 04:03:22,562] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 70 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 245949) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python [default7]:Traceback (most recent call last): [default1]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]:[2022-03-04 04:03:22,545] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 225 [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 5 (pid: 248201) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python [default3]:[2022-03-04 04:03:22,559] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 131 [default3]:Traceback (most recent call last): [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default3]: main() [default3]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default3]: return f(*args, **kwargs) [default3]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default3]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default3]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default3]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default3]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default3]: success = self._load_zero_checkpoint( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default3]: self.optimizer.load_state_dict( [default3]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default3]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default3]:KeyError: 'clip_grad' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 253190) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 259743) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python [default1]:Traceback (most recent call last): [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default1]: main() [default1]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default1]: return f(*args, **kwargs) [default1]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default1]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default1]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default1]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default1]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default1]: success = self._load_zero_checkpoint( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default1]: self.optimizer.load_state_dict( [default1]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default1]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default1]:KeyError: 'clip_grad' [default1]:[2022-03-04 04:03:22,696] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 129 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 297995) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python [default5]:Traceback (most recent call last): [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default5]: main() [default7]:[2022-03-04 04:03:22,820] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 135 [default5]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default5]: return f(*args, **kwargs) [default5]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default5]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default5]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default5]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default5]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default5]: success = self._load_zero_checkpoint( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default5]: self.optimizer.load_state_dict( [default5]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default5]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default5]:KeyError: 'clip_grad' [default5]:[2022-03-04 04:03:22,772] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 133 [default7]:Traceback (most recent call last): [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default7]: main() [default7]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default7]: return f(*args, **kwargs) [default7]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default7]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default7]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default7]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default7]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default7]: success = self._load_zero_checkpoint( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default7]: self.optimizer.load_state_dict( [default7]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default7]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default7]:KeyError: 'clip_grad' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 7 (pid: 287085) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python [default2]:[2022-03-04 04:03:22,932] [INFO] [engine.py:2743:_get_all_zero_checkpoint_state_dicts] successfully read 8 ZeRO state_dicts for rank 130 [default2]:Traceback (most recent call last): [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 249, in <module> [default2]: main() [default2]: File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper [default2]: return f(*args, **kwargs) [default2]: File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main [default2]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain [default2]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer [default2]: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint [default2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint [default2]: success = self._load_zero_checkpoint( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint [default2]: self.optimizer.load_state_dict( [default2]: File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict [default2]: self.clip_grad = current_rank_sd[CLIP_GRAD] [default2]:KeyError: 'clip_grad' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 242645) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 249073) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253346 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253347 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253349 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253350 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 251307 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 251308 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 251311 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 254537 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253463 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 221874 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 256020 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 253488 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 129885 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 89534 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 89535 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 89537 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 89538 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 291330 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 231555 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 247801) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 90537) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 250175) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 264226) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 252588) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 233472) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 227344 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 270064) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 77365) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 247683) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 255667) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 242925) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 291331) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 227343) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 256186) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'jean-zay-iam45-ib0_246954_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousTimeoutError. WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'jean-zay-iam35-ib0_246581_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousTimeoutError. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 129889) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 5 (pid: 221879) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 231559) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 253467) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 254538) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 247377) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 253492) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 256022) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 251309) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 253348) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python Fatal Python error: Segmentation fault Current thread 0x0000145ce13c5700 (most recent call first): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/linecache.py", line 74 in checkcache File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/inspect.py", line 783 in findsource File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/inspect.py", line 1477 in getframeinfo File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/inspect.py", line 1503 in getouterframes File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/inspect.py", line 1526 in stack File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 42 in get_method_name File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 585 in _record File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 619 in run File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1143 in _keep_alive File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1133 in _keep_alive_weak File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/utils.py", line 255 in _run File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/threading.py", line 870 in run File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/threading.py", line 932 in _bootstrap_inner File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/threading.py", line 890 in _bootstrap Thread 0x0000145dea52e600 (most recent call first): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py", line 31 in get_all File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py", line 53 in synchronize File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py", line 67 in barrier File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 906 in _exit_barrier File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 877 in _invoke_run File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709 in run File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125 in wrapper File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 236 in launch_agent File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131 in __call__ File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715 in run File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724 in main File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345 in wrapper File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728 in <module> File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87 in _run_code File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194 in _run_module_as_main ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 89536) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python Fatal Python error: Segmentation fault Current thread 0x000014630de05700 (most recent call first): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/genericpath.py", line 19 in exists File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/inspect.py", line 705 in getsourcefile File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/inspect.py", line 1473 in getframeinfo File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/inspect.py", line 1503 in getouterframes File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/inspect.py", line 1526 in stack File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 42 in get_method_name File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 585 in _record File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 667 in _keep_alive File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 645 in run File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1143 in _keep_alive File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1133 in _keep_alive_weak File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/utils.py", line 255 in _run File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/threading.py", line 870 in run File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/threading.py", line 932 in _bootstrap_inner File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/threading.py", line 890 in _bootstrap Thread 0x0000146416f6e600 (most recent call first): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py", line 31 in get_all File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py", line 53 in synchronize File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py", line 67 in barrier File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 906 in _exit_barrier Fatal Python error: Segmentation fault Current thread 0x0000151b9b765700 (most recent call first): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/genericpath.py", line 19 in exists File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/inspect.py", line 705 in getsourcefile File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/inspect.py", line 1473 in getframeinfo File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/inspect.py", line 1503 in getouterframes File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/inspect.py", line 1526 in stack File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 42 in get_method_name File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 585 in _record File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 667 in _keep_alive File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 645 in run File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1143 in _keep_alive File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1133 in _keep_alive_weak File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/utils.py", line 255 in _run File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/threading.py", line 870 in run File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/threading.py", line 932 in _bootstrap_inner File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/threading.py", line 890 in _bootstrap Thread 0x0000151ca48ce600 (most recent call first): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py", line 31 in get_all File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py", line 53 in synchronize File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py", line 67 in barrier File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 906 in _exit_barrier Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main exec(code, run_globals) Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main exec(code, run_globals) return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code return _run_code(code, main_globals, None, return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) exec(code, run_globals) return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> return _run_code(code, main_globals, None, exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main exec(code, run_globals) exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> main() return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) main() exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> main() return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper main() return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper elastic_launch( main() exec(code, run_globals) main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper exec(code, run_globals) exec(code, run_globals) return f(*args, **kwargs) return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main return f(*args, **kwargs) return _run_code(code, main_globals, None, return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main exec(code, run_globals) return f(*args, **kwargs) return f(*args, **kwargs) Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main run(args) main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> raise ChildFailedError( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper run(args) main() main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-03-04_04:03:20 host : jean-zay-iam10-ib0 rank : 73 (local_rank: 1) exitcode : 1 (pid: 255668) error_file: /tmp/torchelastic_v9wsmk1k/none_jfeihj5a/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, main() run(args) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run return _run_code(code, main_globals, None, run(args) main() run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:21 host : jean-zay-iam10-ib0 rank : 74 (local_rank: 2) exitcode : 1 (pid: 255669) error_file: /tmp/torchelastic_v9wsmk1k/none_jfeihj5a/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run run(args) main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run return f(*args, **kwargs) return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:21 host : jean-zay-iam10-ib0 rank : 75 (local_rank: 3) exitcode : 1 (pid: 255670) error_file: /tmp/torchelastic_v9wsmk1k/none_jfeihj5a/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) run(args) exec(code, run_globals) return f(*args, **kwargs) run(args) return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> elastic_launch( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [4]: time : 2022-03-04_04:03:18 host : jean-zay-iam10-ib0 rank : 76 (local_rank: 4) exitcode : 1 (pid: 255671) error_file: /tmp/torchelastic_v9wsmk1k/none_jfeihj5a/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict return launch_agent(self._config, self._entrypoint, list(args)) return f(*args, **kwargs) return _run_code(code, main_globals, None, self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:20 host : jean-zay-iam10-ib0 rank : 77 (local_rank: 5) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code exitcode : 1 (pid: 255672) error_file: /tmp/torchelastic_v9wsmk1k/none_jfeihj5a/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main elastic_launch( return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [6]: time : 2022-03-04_04:03:21 host : jean-zay-iam10-ib0 rank : 78 (local_rank: 6) exitcode : 1 (pid: 255673) elastic_launch( elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ error_file: /tmp/torchelastic_v9wsmk1k/none_jfeihj5a/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ elastic_launch( return _run_code(code, main_globals, None, exec(code, run_globals) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [7]: time : 2022-03-04_04:03:20 host : jean-zay-iam10-ib0 rank : 79 (local_rank: 7) exitcode : 1 (pid: 255674) elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ raise ChildFailedError( run(args) raise ChildFailedError( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code run(args) return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> error_file: /tmp/torchelastic_v9wsmk1k/none_jfeihj5a/attempt_0/7/error.json traceback : Traceback (most recent call last): elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) main() return launch_agent(self._config, self._entrypoint, list(args)) run(args) return launch_agent(self._config, self._entrypoint, list(args)) torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-03-04_04:03:13 host : jean-zay-iam35-ib0 rank : 277 (local_rank: 5) exitcode : 1 (pid: 246697) error_file: /tmp/torchelastic_73ef0in3/none_rjpdp1c0/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, elastic_launch( run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-03-04_04:03:22 host : jean-zay-iam29-ib0 rank : 227 (local_rank: 3) exitcode : 1 (pid: 251310) error_file: /tmp/torchelastic_pbe9bxkf/none_i9gj4lo6/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> raise ChildFailedError( loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:18 host : jean-zay-iam10-ib0 rank : 72 (local_rank: 0) exitcode : 1 (pid: 255667) return launch_agent(self._config, self._entrypoint, list(args)) return launch_agent(self._config, self._entrypoint, list(args)) raise ChildFailedError( return f(*args, **kwargs) raise ChildFailedError( raise ChildFailedError( Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:13 host : jean-zay-iam35-ib0 rank : 276 (local_rank: 4) exitcode : 1 (pid: 246696) error_file: /tmp/torchelastic_73ef0in3/none_rjpdp1c0/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:22 host : jean-zay-iam29-ib0 rank : 229 (local_rank: 5) exitcode : 1 (pid: 251312) error_file: /tmp/torchelastic_pbe9bxkf/none_i9gj4lo6/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main return _run_code(code, main_globals, None, run(args) torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-03-04_04:03:21 host : jean-zay-iam12-ib0 rank : 90 (local_rank: 2) exitcode : 1 (pid: 254539) error_file: /tmp/torchelastic_rf_i17w9/none_hrcjsyal/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, run(args) error_file: /tmp/torchelastic_v9wsmk1k/none_jfeihj5a/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-03-04_04:03:20 host : jean-zay-iam04-ib0 rank : 25 (local_rank: 1) exitcode : 1 (pid: 252589) error_file: /tmp/torchelastic_xs2z2rk2/none_asx79gi8/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-03-04_04:03:21 host : jean-zay-iam19-ib0 rank : 146 (local_rank: 2) exitcode : 1 (pid: 227345) error_file: /tmp/torchelastic_k33r8okq/none_pghuqt6x/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [0]: time : 2022-03-04_04:03:21 host : jean-zay-iam13-ib0 rank : 97 (local_rank: 1) exitcode : 1 (pid: 256021) error_file: /tmp/torchelastic_nmpml80r/none_xip4wi3r/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ raise ChildFailedError( raise ChildFailedError( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( elastic_launch( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( raise ChildFailedError( Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main elastic_launch( return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict exec(code, run_globals) main() Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main elastic_launch( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:21 host : jean-zay-iam12-ib0 rank : 91 (local_rank: 3) exitcode : 1 (pid: 254540) error_file: /tmp/torchelastic_rf_i17w9/none_hrcjsyal/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-03-04_04:03:20 host : jean-zay-iam23-ib0 rank : 177 (local_rank: 1) exitcode : 1 (pid: 90538) error_file: /tmp/torchelastic_xtaya7c1/none_0jl3d_bm/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:17 host : jean-zay-iam24-ib0 rank : 184 (local_rank: 0) exitcode : 1 (pid: 259743) error_file: /tmp/torchelastic_iy55snta/none_9rp61zrm/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:20 host : jean-zay-iam04-ib0 rank : 26 (local_rank: 2) exitcode : 1 (pid: 252590) error_file: /tmp/torchelastic_xs2z2rk2/none_asx79gi8/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:21 host : jean-zay-iam19-ib0 rank : 147 (local_rank: 3) exitcode : 1 (pid: 227346) error_file: /tmp/torchelastic_k33r8okq/none_pghuqt6x/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:21 host : jean-zay-iam13-ib0 rank : 99 (local_rank: 3) exitcode : 1 (pid: 256023) error_file: /tmp/torchelastic_nmpml80r/none_xip4wi3r/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main exec(code, run_globals) torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-03-04_04:03:21 host : jean-zay-iam27-ib0 rank : 209 (local_rank: 1) exitcode : 1 (pid: 233473) error_file: /tmp/torchelastic_solv9rst/none_3t4blzzr/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:22 host : jean-zay-iam29-ib0 rank : 230 (local_rank: 6) exitcode : 1 (pid: 251313) error_file: /tmp/torchelastic_pbe9bxkf/none_i9gj4lo6/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( return f(*args, **kwargs) elastic_launch( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( run(args) pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( return launch_agent(self._config, self._entrypoint, list(args)) pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent return f(*args, **kwargs) elastic_launch( exec(code, run_globals) return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:19 host : jean-zay-iam23-ib0 rank : 178 (local_rank: 2) exitcode : 1 (pid: 90539) error_file: /tmp/torchelastic_xtaya7c1/none_0jl3d_bm/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:20 host : jean-zay-iam27-ib0 rank : 210 (local_rank: 2) exitcode : 1 (pid: 233474) error_file: /tmp/torchelastic_solv9rst/none_3t4blzzr/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:21 host : jean-zay-iam12-ib0 rank : 92 (local_rank: 4) exitcode : 1 (pid: 254541) error_file: /tmp/torchelastic_rf_i17w9/none_hrcjsyal/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:21 host : jean-zay-iam04-ib0 rank : 27 (local_rank: 3) exitcode : 1 (pid: 252591) error_file: /tmp/torchelastic_xs2z2rk2/none_asx79gi8/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:20 host : jean-zay-iam19-ib0 rank : 148 (local_rank: 4) exitcode : 1 (pid: 227347) error_file: /tmp/torchelastic_k33r8okq/none_pghuqt6x/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) raise ChildFailedError( self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:21 host : jean-zay-iam13-ib0 rank : 100 (local_rank: 4) exitcode : 1 (pid: 256024) error_file: /tmp/torchelastic_nmpml80r/none_xip4wi3r/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper raise ChildFailedError( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-03-04_04:03:22 host : jean-zay-iam09-ib0 rank : 69 (local_rank: 5) exitcode : 1 (pid: 253351) error_file: /tmp/torchelastic_jxwxxlos/none_0y5r9d70/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict main() File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-03-04_04:03:20 host : jean-zay-iam03-ib0 rank : 17 (local_rank: 1) exitcode : 1 (pid: 264227) error_file: /tmp/torchelastic_2m77mxf3/none_u1r9yr9g/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, return launch_agent(self._config, self._entrypoint, list(args)) self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [4]: time : 2022-03-04_04:03:21 host : jean-zay-iam12-ib0 rank : 93 (local_rank: 5) exitcode : 1 (pid: 254542) error_file: /tmp/torchelastic_rf_i17w9/none_hrcjsyal/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:19 host : jean-zay-iam23-ib0 rank : 179 (local_rank: 3) exitcode : 1 (pid: 90540) error_file: /tmp/torchelastic_xtaya7c1/none_0jl3d_bm/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [4]: time : 2022-03-04_04:03:19 host : jean-zay-iam04-ib0 rank : 28 (local_rank: 4) exitcode : 1 (pid: 252592) error_file: /tmp/torchelastic_xs2z2rk2/none_asx79gi8/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [4]: time : 2022-03-04_04:03:21 host : jean-zay-iam19-ib0 rank : 149 (local_rank: 5) exitcode : 1 (pid: 227348) error_file: /tmp/torchelastic_k33r8okq/none_pghuqt6x/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [4]: time : 2022-03-04_04:03:21 host : jean-zay-iam13-ib0 rank : 101 (local_rank: 5) exitcode : 1 (pid: 256025) error_file: /tmp/torchelastic_nmpml80r/none_xip4wi3r/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:20 host : jean-zay-iam27-ib0 rank : 211 (local_rank: 3) exitcode : 1 (pid: 233475) error_file: /tmp/torchelastic_solv9rst/none_3t4blzzr/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [0]: time : 2022-03-04_04:03:21 host : jean-zay-iam37-ib0 rank : 289 (local_rank: 1) exitcode : 1 (pid: 231556) error_file: /tmp/torchelastic_palp8kdb/none_ty7ox4jd/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main raise ChildFailedError( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [0]: time : 2022-03-04_04:03:21 host : jean-zay-iam28-ib0 rank : 217 (local_rank: 1) exitcode : 1 (pid: 129886) error_file: /tmp/torchelastic__j_8gndg/none_82iifp9v/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:22 host : jean-zay-iam09-ib0 rank : 70 (local_rank: 6) exitcode : 1 (pid: 253352) error_file: /tmp/torchelastic_jxwxxlos/none_0y5r9d70/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:20 host : jean-zay-iam03-ib0 rank : 18 (local_rank: 2) exitcode : 1 (pid: 264228) error_file: /tmp/torchelastic_2m77mxf3/none_u1r9yr9g/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-03-04_04:03:18 host : jean-zay-iam26-ib0 rank : 201 (local_rank: 1) exitcode : 1 (pid: 245862) error_file: /tmp/torchelastic_btjb0pc4/none_te9ioczi/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:21 host : jean-zay-iam12-ib0 rank : 94 (local_rank: 6) self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [4]: time : 2022-03-04_04:03:19 host : jean-zay-iam23-ib0 rank : 180 (local_rank: 4) exitcode : 1 (pid: 90541) error_file: /tmp/torchelastic_xtaya7c1/none_0jl3d_bm/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:21 host : jean-zay-iam04-ib0 rank : 29 (local_rank: 5) self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:21 host : jean-zay-iam19-ib0 rank : 150 (local_rank: 6) pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:21 host : jean-zay-iam13-ib0 rank : 102 (local_rank: 6) self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [4]: time : 2022-03-04_04:03:21 host : jean-zay-iam27-ib0 rank : 212 (local_rank: 4) exitcode : 1 (pid: 233476) error_file: /tmp/torchelastic_solv9rst/none_3t4blzzr/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [1]: time : 2022-03-04_04:03:21 host : jean-zay-iam37-ib0 rank : 290 (local_rank: 2) exitcode : 1 (pid: 231557) error_file: /tmp/torchelastic_palp8kdb/none_ty7ox4jd/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [1]: time : 2022-03-04_04:03:21 host : jean-zay-iam28-ib0 rank : 218 (local_rank: 2) exitcode : 1 (pid: 129887) error_file: /tmp/torchelastic__j_8gndg/none_82iifp9v/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main exitcode : 1 (pid: 254543) error_file: /tmp/torchelastic_rf_i17w9/none_hrcjsyal/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) run(args) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict exitcode : 1 (pid: 252593) error_file: /tmp/torchelastic_xs2z2rk2/none_asx79gi8/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) elastic_launch( exitcode : 1 (pid: 227349) error_file: /tmp/torchelastic_k33r8okq/none_pghuqt6x/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict exitcode : 1 (pid: 256026) error_file: /tmp/torchelastic_nmpml80r/none_xip4wi3r/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:18 host : jean-zay-iam26-ib0 rank : 203 (local_rank: 3) exitcode : 1 (pid: 245864) error_file: /tmp/torchelastic_btjb0pc4/none_te9ioczi/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main run(args) raise ChildFailedError( main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [6]: time : 2022-03-04_04:03:21 host : jean-zay-iam12-ib0 rank : 95 (local_rank: 7) exitcode : 1 (pid: 254544) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:20 host : jean-zay-iam23-ib0 rank : 181 (local_rank: 5) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [6]: time : 2022-03-04_04:03:20 host : jean-zay-iam04-ib0 rank : 30 (local_rank: 6) exitcode : 1 (pid: 252594) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [6]: time : 2022-03-04_04:03:21 host : jean-zay-iam19-ib0 rank : 151 (local_rank: 7) exitcode : 1 (pid: 227350) self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:22 host : jean-zay-iam09-ib0 rank : 71 (local_rank: 7) exitcode : 1 (pid: 253353) error_file: /tmp/torchelastic_jxwxxlos/none_0y5r9d70/attempt_0/7/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [6]: time : 2022-03-04_04:03:21 host : jean-zay-iam13-ib0 rank : 103 (local_rank: 7) exitcode : 1 (pid: 256027) return f(*args, **kwargs) self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:21 host : jean-zay-iam27-ib0 rank : 213 (local_rank: 5) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:20 host : jean-zay-iam03-ib0 rank : 19 (local_rank: 3) exitcode : 1 (pid: 264229) error_file: /tmp/torchelastic_2m77mxf3/none_u1r9yr9g/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict exitcode : 1 (pid: 90542) error_file: /tmp/torchelastic_xtaya7c1/none_0jl3d_bm/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) error_file: /tmp/torchelastic_xs2z2rk2/none_asx79gi8/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) error_file: /tmp/torchelastic_k33r8okq/none_pghuqt6x/attempt_0/7/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict error_file: /tmp/torchelastic_nmpml80r/none_xip4wi3r/attempt_0/7/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main exitcode : 1 (pid: 233477) error_file: /tmp/torchelastic_solv9rst/none_3t4blzzr/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:21 host : jean-zay-iam37-ib0 rank : 291 (local_rank: 3) exitcode : 1 (pid: 231558) error_file: /tmp/torchelastic_palp8kdb/none_ty7ox4jd/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:21 host : jean-zay-iam28-ib0 rank : 219 (local_rank: 3) exitcode : 1 (pid: 129888) error_file: /tmp/torchelastic__j_8gndg/none_82iifp9v/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [6]: time : 2022-03-04_04:03:20 host : jean-zay-iam23-ib0 rank : 182 (local_rank: 6) exitcode : 1 (pid: 90543) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [7]: time : 2022-03-04_04:03:20 host : jean-zay-iam04-ib0 rank : 31 (local_rank: 7) exitcode : 1 (pid: 252595) return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:19 self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:22 host : jean-zay-iam09-ib0 rank : 66 (local_rank: 2) exitcode : 1 (pid: 253348) error_file: /tmp/torchelastic_jxwxxlos/none_0y5r9d70/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [1]: time : 2022-03-04_04:03:20 File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [6]: time : 2022-03-04_04:03:19 host : jean-zay-iam27-ib0 rank : 214 (local_rank: 6) exitcode : 1 (pid: 233478) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [4]: time : 2022-03-04_04:03:20 host : jean-zay-iam03-ib0 rank : 20 (local_rank: 4) exitcode : 1 (pid: 264230) error_file: /tmp/torchelastic_2m77mxf3/none_u1r9yr9g/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:18 host : jean-zay-iam26-ib0 rank : 204 (local_rank: 4) exitcode : 1 (pid: 245865) error_file: /tmp/torchelastic_btjb0pc4/none_te9ioczi/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict error_file: /tmp/torchelastic_xtaya7c1/none_0jl3d_bm/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) error_file: /tmp/torchelastic_xs2z2rk2/none_asx79gi8/attempt_0/7/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent host : jean-zay-iam19-ib0 rank : 144 (local_rank: 0) exitcode : 1 (pid: 227343) error_file: /tmp/torchelastic_k33r8okq/none_pghuqt6x/attempt_0/0/error.json model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( host : jean-zay-iam13-ib0 rank : 98 (local_rank: 2) exitcode : 1 (pid: 256022) error_file: /tmp/torchelastic_nmpml80r/none_xip4wi3r/attempt_0/2/error.json error_file: /tmp/torchelastic_solv9rst/none_3t4blzzr/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [4]: time : 2022-03-04_04:03:21 host : jean-zay-iam37-ib0 rank : 293 (local_rank: 5) exitcode : 1 (pid: 231560) error_file: /tmp/torchelastic_palp8kdb/none_ty7ox4jd/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [0]: time : 2022-03-04_04:03:19 host : jean-zay-iam30-ib0 rank : 232 (local_rank: 0) exitcode : 1 (pid: 250171) error_file: /tmp/torchelastic_iwh6pg11/none_abpr5ybr/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [4]: time : 2022-03-04_04:03:21 host : jean-zay-iam28-ib0 rank : 221 (local_rank: 5) exitcode : 1 (pid: 129890) error_file: /tmp/torchelastic__j_8gndg/none_82iifp9v/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [7]: time : 2022-03-04_04:03:20 host : jean-zay-iam23-ib0 rank : 183 (local_rank: 7) exitcode : 1 (pid: 90544) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [7]: time : 2022-03-04_04:03:21 host : jean-zay-iam27-ib0 rank : 215 (local_rank: 7) exitcode : 1 (pid: 233479) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:20 host : jean-zay-iam03-ib0 rank : 21 (local_rank: 5) elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [4]: time : 2022-03-04_04:03:18 host : jean-zay-iam26-ib0 rank : 205 (local_rank: 5) exitcode : 1 (pid: 245866) error_file: /tmp/torchelastic_btjb0pc4/none_te9ioczi/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict error_file: /tmp/torchelastic_xtaya7c1/none_0jl3d_bm/attempt_0/7/error.json traceback : Traceback (most recent call last): loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:19 host : jean-zay-iam04-ib0 rank : 24 (local_rank: 0) exitcode : 1 (pid: 252588) loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ ============================================================ loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:21 host : jean-zay-iam37-ib0 rank : 294 (local_rank: 6) run(args) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [1]: time : 2022-03-04_04:03:20 host : jean-zay-iam30-ib0 rank : 233 (local_rank: 1) exitcode : 1 (pid: 250172) error_file: /tmp/torchelastic_iwh6pg11/none_abpr5ybr/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main return f(*args, **kwargs) self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:21 host : jean-zay-iam28-ib0 rank : 222 (local_rank: 6) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint error_file: /tmp/torchelastic_xs2z2rk2/none_asx79gi8/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) raise ChildFailedError( exitcode : 1 (pid: 231561) error_file: /tmp/torchelastic_palp8kdb/none_ty7ox4jd/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main exitcode : 1 (pid: 129891) error_file: /tmp/torchelastic__j_8gndg/none_82iifp9v/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:17 host : jean-zay-iam23-ib0 rank : 176 (local_rank: 0) exitcode : 1 (pid: 90537) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [6]: time : 2022-03-04_04:03:21 host : jean-zay-iam37-ib0 rank : 295 (local_rank: 7) exitcode : 1 (pid: 231562) time : 2022-03-04_04:03:16 host : jean-zay-iam26-ib0 rank : 200 (local_rank: 0) exitcode : 1 (pid: 245861) error_file: /tmp/torchelastic_btjb0pc4/none_te9ioczi/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [6]: time : 2022-03-04_04:03:20 host : jean-zay-iam28-ib0 rank : 223 (local_rank: 7) exitcode : 1 (pid: 129892) error_file: /tmp/torchelastic_xtaya7c1/none_0jl3d_bm/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) error_file: /tmp/torchelastic_palp8kdb/none_ty7ox4jd/attempt_0/7/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:21 host : jean-zay-iam30-ib0 rank : 234 (local_rank: 2) exitcode : 1 (pid: 250173) error_file: /tmp/torchelastic_iwh6pg11/none_abpr5ybr/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) error_file: /tmp/torchelastic__j_8gndg/none_82iifp9v/attempt_0/7/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) elastic_launch( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-03-04_04:03:21 host : jean-zay-iam34-ib0 rank : 265 (local_rank: 1) exitcode : 1 (pid: 247684) error_file: /tmp/torchelastic_5j0qufv2/none_3bo5ywq4/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, main() File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [3]: time : 2022-03-04_04:03:20 return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main elastic_launch( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [3]: time : 2022-03-04_04:03:20 File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper host : jean-zay-iam37-ib0 rank : 292 (local_rank: 4) exitcode : 1 (pid: 231559) error_file: /tmp/torchelastic_palp8kdb/none_ty7ox4jd/attempt_0/4/error.json File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:20 host : jean-zay-iam30-ib0 rank : 235 (local_rank: 3) exitcode : 1 (pid: 250174) error_file: /tmp/torchelastic_iwh6pg11/none_abpr5ybr/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) host : jean-zay-iam28-ib0 rank : 220 (local_rank: 4) exitcode : 1 (pid: 129889) error_file: /tmp/torchelastic__j_8gndg/none_82iifp9v/attempt_0/4/error.json File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:21 host : jean-zay-iam34-ib0 rank : 266 (local_rank: 2) exitcode : 1 (pid: 247685) error_file: /tmp/torchelastic_5j0qufv2/none_3bo5ywq4/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:21 host : jean-zay-iam30-ib0 rank : 237 (local_rank: 5) return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict exitcode : 1 (pid: 250176) error_file: /tmp/torchelastic_iwh6pg11/none_abpr5ybr/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:20 host : jean-zay-iam34-ib0 rank : 267 (local_rank: 3) exitcode : 1 (pid: 247686) error_file: /tmp/torchelastic_5j0qufv2/none_3bo5ywq4/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) raise ChildFailedError( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict elastic_launch( self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [4]: time : 2022-03-04_04:03:19 host : jean-zay-iam34-ib0 rank : 268 (local_rank: 4) exitcode : 1 (pid: 247687) error_file: /tmp/torchelastic_5j0qufv2/none_3bo5ywq4/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ run(args) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:21 host : jean-zay-iam34-ib0 rank : 269 (local_rank: 5) torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [0]: time : 2022-03-04_04:03:21 host : jean-zay-iam33-ib0 rank : 256 (local_rank: 0) exitcode : 1 (pid: 247373) error_file: /tmp/torchelastic_69k__5yq/none_1shakq5n/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, exitcode : 1 (pid: 247688) error_file: /tmp/torchelastic_5j0qufv2/none_3bo5ywq4/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main return f(*args, **kwargs) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( exec(code, run_globals) return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [6]: time : 2022-03-04_04:03:20 host : jean-zay-iam34-ib0 rank : 270 (local_rank: 6) exitcode : 1 (pid: 247689) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [1]: time : 2022-03-04_04:03:21 host : jean-zay-iam33-ib0 rank : 257 (local_rank: 1) exitcode : 1 (pid: 247374) error_file: /tmp/torchelastic_69k__5yq/none_1shakq5n/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> raise ChildFailedError( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent error_file: /tmp/torchelastic_5j0qufv2/none_3bo5ywq4/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) elastic_launch( pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-03-04_04:03:20 host : jean-zay-iam02-ib0 rank : 9 (local_rank: 1) exitcode : 1 (pid: 256187) error_file: /tmp/torchelastic_krr_ajm9/none_e2n52mvu/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [7]: time : 2022-03-04_04:03:20 host : jean-zay-iam34-ib0 rank : 271 (local_rank: 7) exitcode : 1 (pid: 247690) return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( error_file: /tmp/torchelastic_5j0qufv2/none_3bo5ywq4/attempt_0/7/error.json traceback : Traceback (most recent call last): self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:21 host : jean-zay-iam33-ib0 rank : 258 (local_rank: 2) exitcode : 1 (pid: 247375) error_file: /tmp/torchelastic_69k__5yq/none_1shakq5n/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:20 host : jean-zay-iam02-ib0 rank : 10 (local_rank: 2) exitcode : 1 (pid: 256188) error_file: /tmp/torchelastic_krr_ajm9/none_e2n52mvu/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main raise ChildFailedError( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:18 host : jean-zay-iam34-ib0 rank : 264 (local_rank: 0) exitcode : 1 (pid: 247683) self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:21 host : jean-zay-iam33-ib0 rank : 259 (local_rank: 3) exitcode : 1 (pid: 247376) error_file: /tmp/torchelastic_69k__5yq/none_1shakq5n/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict error_file: /tmp/torchelastic_5j0qufv2/none_3bo5ywq4/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:21 host : jean-zay-iam02-ib0 rank : 11 (local_rank: 3) exitcode : 1 (pid: 256189) error_file: /tmp/torchelastic_krr_ajm9/none_e2n52mvu/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:21 host : jean-zay-iam33-ib0 rank : 261 (local_rank: 5) elastic_launch( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict exec(code, run_globals) exitcode : 1 (pid: 247378) error_file: /tmp/torchelastic_69k__5yq/none_1shakq5n/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [4]: time : 2022-03-04_04:03:20 host : jean-zay-iam02-ib0 rank : 12 (local_rank: 4) exitcode : 1 (pid: 256190) error_file: /tmp/torchelastic_krr_ajm9/none_e2n52mvu/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [0]: time : 2022-03-04_04:03:21 host : jean-zay-iam11-ib0 rank : 81 (local_rank: 1) exitcode : 1 (pid: 253489) error_file: /tmp/torchelastic_a345yzbo/none_i4v615n3/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [6]: time : 2022-03-04_04:03:21 host : jean-zay-iam33-ib0 rank : 262 (local_rank: 6) exitcode : 1 (pid: 247379) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( error_file: /tmp/torchelastic_69k__5yq/none_1shakq5n/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) main() self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:21 host : jean-zay-iam02-ib0 rank : 13 (local_rank: 5) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [1]: time : 2022-03-04_04:03:21 host : jean-zay-iam11-ib0 rank : 82 (local_rank: 2) exitcode : 1 (pid: 253490) error_file: /tmp/torchelastic_a345yzbo/none_i4v615n3/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [7]: time : 2022-03-04_04:03:20 host : jean-zay-iam33-ib0 rank : 263 (local_rank: 7) exitcode : 1 (pid: 247380) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper exitcode : 1 (pid: 256191) error_file: /tmp/torchelastic_krr_ajm9/none_e2n52mvu/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( error_file: /tmp/torchelastic_69k__5yq/none_1shakq5n/attempt_0/7/error.json traceback : Traceback (most recent call last): File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [6]: time : 2022-03-04_04:03:21 host : jean-zay-iam02-ib0 rank : 14 (local_rank: 6) exitcode : 1 (pid: 256192) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint error_file: /tmp/torchelastic_krr_ajm9/none_e2n52mvu/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:21 host : jean-zay-iam11-ib0 rank : 83 (local_rank: 3) exitcode : 1 (pid: 253491) error_file: /tmp/torchelastic_a345yzbo/none_i4v615n3/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [4]: time : 2022-03-04_04:03:20 host : jean-zay-iam33-ib0 rank : 260 (local_rank: 4) exitcode : 1 (pid: 247377) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [7]: time : 2022-03-04_04:03:21 host : jean-zay-iam02-ib0 rank : 15 (local_rank: 7) exitcode : 1 (pid: 256193) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict error_file: /tmp/torchelastic_69k__5yq/none_1shakq5n/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) error_file: /tmp/torchelastic_krr_ajm9/none_e2n52mvu/attempt_0/7/error.json traceback : Traceback (most recent call last): self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [4]: time : 2022-03-04_04:03:21 host : jean-zay-iam11-ib0 rank : 85 (local_rank: 5) exitcode : 1 (pid: 253493) error_file: /tmp/torchelastic_a345yzbo/none_i4v615n3/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict return f(*args, **kwargs) loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:20 host : jean-zay-iam02-ib0 rank : 8 (local_rank: 0) exitcode : 1 (pid: 256186) self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:21 host : jean-zay-iam11-ib0 rank : 86 (local_rank: 6) run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main error_file: /tmp/torchelastic_krr_ajm9/none_e2n52mvu/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) exitcode : 1 (pid: 253494) error_file: /tmp/torchelastic_a345yzbo/none_i4v615n3/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [6]: time : 2022-03-04_04:03:21 host : jean-zay-iam11-ib0 rank : 87 (local_rank: 7) exitcode : 1 (pid: 253495) error_file: /tmp/torchelastic_a345yzbo/none_i4v615n3/attempt_0/7/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [3]: time : 2022-03-04_04:03:20 host : jean-zay-iam11-ib0 rank : 84 (local_rank: 4) exitcode : 1 (pid: 253492) error_file: /tmp/torchelastic_a345yzbo/none_i4v615n3/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ elastic_launch( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-03-04_04:03:21 host : jean-zay-iam46-ib0 rank : 361 (local_rank: 1) exitcode : 1 (pid: 247802) error_file: /tmp/torchelastic_e6lr8l3p/none_48_6jkre/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:21 host : jean-zay-iam46-ib0 rank : 362 (local_rank: 2) exitcode : 1 (pid: 247803) error_file: /tmp/torchelastic_e6lr8l3p/none_48_6jkre/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main return launch_agent(self._config, self._entrypoint, list(args)) pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( return launch_agent(self._config, self._entrypoint, list(args)) self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:21 host : jean-zay-iam46-ib0 rank : 363 (local_rank: 3) exitcode : 1 (pid: 247804) error_file: /tmp/torchelastic_e6lr8l3p/none_48_6jkre/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict raise ChildFailedError( self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [4]: time : 2022-03-04_04:03:20 host : jean-zay-iam46-ib0 rank : 364 (local_rank: 4) exitcode : 1 (pid: 247805) error_file: /tmp/torchelastic_e6lr8l3p/none_48_6jkre/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [0]: time : 2022-03-04_04:03:20 host : jean-zay-iam15-ib0 rank : 113 (local_rank: 1) exitcode : 1 (pid: 221875) error_file: /tmp/torchelastic_89te1n3g/none_t6tchlnq/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict raise ChildFailedError( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( raise ChildFailedError( self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:21 host : jean-zay-iam46-ib0 rank : 365 (local_rank: 5) torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-03-04_04:03:22 host : jean-zay-iam17-ib0 rank : 133 (local_rank: 5) exitcode : 1 (pid: 89539) error_file: /tmp/torchelastic_9wudqz0i/none_6bors0rf/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [1]: time : 2022-03-04_04:03:20 host : jean-zay-iam15-ib0 rank : 114 (local_rank: 2) exitcode : 1 (pid: 221876) error_file: /tmp/torchelastic_89te1n3g/none_t6tchlnq/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main exitcode : 1 (pid: 247806) error_file: /tmp/torchelastic_e6lr8l3p/none_48_6jkre/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) elastic_launch( return launch_agent(self._config, self._entrypoint, list(args)) torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-03-04_04:03:18 host : jean-zay-iam42-ib0 rank : 329 (local_rank: 1) exitcode : 1 (pid: 254450) error_file: /tmp/torchelastic_a1kwb2ql/none_vcory2iv/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [0]: time : 2022-03-04_04:03:21 host : jean-zay-iam08-ib0 rank : 57 (local_rank: 1) exitcode : 1 (pid: 253464) error_file: /tmp/torchelastic_jfxug801/none_keh1rze3/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [6]: time : 2022-03-04_04:03:21 host : jean-zay-iam46-ib0 rank : 366 (local_rank: 6) exitcode : 1 (pid: 247807) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:22 host : jean-zay-iam17-ib0 rank : 134 (local_rank: 6) exitcode : 1 (pid: 89540) error_file: /tmp/torchelastic_9wudqz0i/none_6bors0rf/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent error_file: /tmp/torchelastic_e6lr8l3p/none_48_6jkre/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:17 host : jean-zay-iam42-ib0 rank : 331 (local_rank: 3) exitcode : 1 (pid: 254452) error_file: /tmp/torchelastic_a1kwb2ql/none_vcory2iv/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:21 host : jean-zay-iam15-ib0 rank : 115 (local_rank: 3) exitcode : 1 (pid: 221877) error_file: /tmp/torchelastic_89te1n3g/none_t6tchlnq/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [1]: time : 2022-03-04_04:03:21 host : jean-zay-iam08-ib0 rank : 58 (local_rank: 2) exitcode : 1 (pid: 253465) error_file: /tmp/torchelastic_jfxug801/none_keh1rze3/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [7]: time : 2022-03-04_04:03:21 host : jean-zay-iam46-ib0 rank : 367 (local_rank: 7) exitcode : 1 (pid: 247808) raise ChildFailedError( pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( error_file: /tmp/torchelastic_e6lr8l3p/none_48_6jkre/attempt_0/7/error.json traceback : Traceback (most recent call last): return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:22 host : jean-zay-iam17-ib0 rank : 135 (local_rank: 7) exitcode : 1 (pid: 89541) error_file: /tmp/torchelastic_9wudqz0i/none_6bors0rf/attempt_0/7/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:20 host : jean-zay-iam15-ib0 rank : 116 (local_rank: 4) exitcode : 1 (pid: 221878) error_file: /tmp/torchelastic_89te1n3g/none_t6tchlnq/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict raise ChildFailedError( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:18 host : jean-zay-iam42-ib0 rank : 332 (local_rank: 4) exitcode : 1 (pid: 254453) error_file: /tmp/torchelastic_a1kwb2ql/none_vcory2iv/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:21 host : jean-zay-iam08-ib0 rank : 59 (local_rank: 3) exitcode : 1 (pid: 253466) error_file: /tmp/torchelastic_jfxug801/none_keh1rze3/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:19 host : jean-zay-iam46-ib0 rank : 360 (local_rank: 0) exitcode : 1 (pid: 247801) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict exec(code, run_globals) self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:22 host : jean-zay-iam17-ib0 rank : 130 (local_rank: 2) exitcode : 1 (pid: 89536) error_file: /tmp/torchelastic_9wudqz0i/none_6bors0rf/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:20 host : jean-zay-iam15-ib0 rank : 118 (local_rank: 6) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict error_file: /tmp/torchelastic_e6lr8l3p/none_48_6jkre/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-03-04_04:03:20 host : jean-zay-iam14-ib0 rank : 105 (local_rank: 1) exitcode : 1 (pid: 270065) error_file: /tmp/torchelastic_hk2d8vdk/none_cdyyedur/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [4]: time : 2022-03-04_04:03:18 host : jean-zay-iam42-ib0 rank : 334 (local_rank: 6) exitcode : 1 (pid: 254455) error_file: /tmp/torchelastic_a1kwb2ql/none_vcory2iv/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( exitcode : 1 (pid: 221880) error_file: /tmp/torchelastic_89te1n3g/none_t6tchlnq/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [4]: time : 2022-03-04_04:03:21 host : jean-zay-iam08-ib0 rank : 61 (local_rank: 5) exitcode : 1 (pid: 253468) error_file: /tmp/torchelastic_jfxug801/none_keh1rze3/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [6]: time : 2022-03-04_04:03:19 host : jean-zay-iam15-ib0 rank : 119 (local_rank: 7) exitcode : 1 (pid: 221881) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-03-04_04:03:19 host : jean-zay-iam07-ib0 rank : 50 (local_rank: 2) exitcode : 1 (pid: 291332) error_file: /tmp/torchelastic_d22owlx8/none_lbnm0_b9/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:18 host : jean-zay-iam42-ib0 rank : 335 (local_rank: 7) ============================================================ error_file: /tmp/torchelastic_89te1n3g/none_t6tchlnq/attempt_0/7/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:21 host : jean-zay-iam08-ib0 rank : 62 (local_rank: 6) raise ChildFailedError( exitcode : 1 (pid: 254456) error_file: /tmp/torchelastic_a1kwb2ql/none_vcory2iv/attempt_0/7/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [4]: time : 2022-03-04_04:03:19 exitcode : 1 (pid: 253469) error_file: /tmp/torchelastic_jfxug801/none_keh1rze3/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:20 host : jean-zay-iam14-ib0 rank : 106 (local_rank: 2) exitcode : 1 (pid: 270066) error_file: /tmp/torchelastic_hk2d8vdk/none_cdyyedur/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:15 host : jean-zay-iam15-ib0 rank : 117 (local_rank: 5) exitcode : 1 (pid: 221879) error_file: /tmp/torchelastic_89te1n3g/none_t6tchlnq/attempt_0/5/error.json File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [6]: time : 2022-03-04_04:03:21 host : jean-zay-iam08-ib0 rank : 63 (local_rank: 7) exitcode : 1 (pid: 253470) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:21 host : jean-zay-iam07-ib0 rank : 51 (local_rank: 3) exitcode : 1 (pid: 291333) error_file: /tmp/torchelastic_d22owlx8/none_lbnm0_b9/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( host : jean-zay-iam42-ib0 rank : 328 (local_rank: 0) exitcode : 1 (pid: 254449) error_file: /tmp/torchelastic_a1kwb2ql/none_vcory2iv/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint error_file: /tmp/torchelastic_jfxug801/none_keh1rze3/attempt_0/7/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [3]: time : 2022-03-04_04:03:20 pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:20 host : jean-zay-iam14-ib0 rank : 107 (local_rank: 3) exitcode : 1 (pid: 270067) error_file: /tmp/torchelastic_hk2d8vdk/none_cdyyedur/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) host : jean-zay-iam08-ib0 rank : 60 (local_rank: 4) exitcode : 1 (pid: 253467) error_file: /tmp/torchelastic_jfxug801/none_keh1rze3/attempt_0/4/error.json File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-03-04_04:03:20 host : jean-zay-iam38-ib0 rank : 297 (local_rank: 1) exitcode : 1 (pid: 77366) error_file: /tmp/torchelastic_xf1hr23i/none_72szcbet/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:20 host : jean-zay-iam07-ib0 rank : 52 (local_rank: 4) exitcode : 1 (pid: 291334) error_file: /tmp/torchelastic_d22owlx8/none_lbnm0_b9/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [4]: time : 2022-03-04_04:03:18 host : jean-zay-iam14-ib0 rank : 108 (local_rank: 4) exitcode : 1 (pid: 270068) error_file: /tmp/torchelastic_hk2d8vdk/none_cdyyedur/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [4]: time : 2022-03-04_04:03:19 host : jean-zay-iam07-ib0 rank : 53 (local_rank: 5) exitcode : 1 (pid: 291335) error_file: /tmp/torchelastic_d22owlx8/none_lbnm0_b9/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:20 host : jean-zay-iam14-ib0 rank : 109 (local_rank: 5) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict exitcode : 1 (pid: 270069) error_file: /tmp/torchelastic_hk2d8vdk/none_cdyyedur/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:20 host : jean-zay-iam07-ib0 rank : 54 (local_rank: 6) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:20 host : jean-zay-iam38-ib0 rank : 298 (local_rank: 2) exitcode : 1 (pid: 77367) error_file: /tmp/torchelastic_xf1hr23i/none_72szcbet/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [6]: time : 2022-03-04_04:03:20 host : jean-zay-iam14-ib0 rank : 110 (local_rank: 6) exitcode : 1 (pid: 270070) exitcode : 1 (pid: 291336) error_file: /tmp/torchelastic_d22owlx8/none_lbnm0_b9/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) error_file: /tmp/torchelastic_hk2d8vdk/none_cdyyedur/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [6]: time : 2022-03-04_04:03:20 host : jean-zay-iam07-ib0 rank : 55 (local_rank: 7) exitcode : 1 (pid: 291337) pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [7]: time : 2022-03-04_04:03:20 host : jean-zay-iam14-ib0 rank : 111 (local_rank: 7) exitcode : 1 (pid: 270071) error_file: /tmp/torchelastic_d22owlx8/none_lbnm0_b9/attempt_0/7/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict error_file: /tmp/torchelastic_hk2d8vdk/none_cdyyedur/attempt_0/7/error.json traceback : Traceback (most recent call last): File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:19 self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:20 host : jean-zay-iam38-ib0 rank : 299 (local_rank: 3) exitcode : 1 (pid: 77368) error_file: /tmp/torchelastic_xf1hr23i/none_72szcbet/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint host : jean-zay-iam07-ib0 rank : 49 (local_rank: 1) exitcode : 1 (pid: 291331) error_file: /tmp/torchelastic_d22owlx8/none_lbnm0_b9/attempt_0/1/error.json File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:18 host : jean-zay-iam14-ib0 rank : 104 (local_rank: 0) exitcode : 1 (pid: 270064) main() traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [4]: time : 2022-03-04_04:03:18 host : jean-zay-iam38-ib0 rank : 300 (local_rank: 4) exitcode : 1 (pid: 77369) error_file: /tmp/torchelastic_xf1hr23i/none_72szcbet/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) error_file: /tmp/torchelastic_hk2d8vdk/none_cdyyedur/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:20 host : jean-zay-iam38-ib0 rank : 301 (local_rank: 5) exitcode : 1 (pid: 77370) error_file: /tmp/torchelastic_xf1hr23i/none_72szcbet/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code return launch_agent(self._config, self._entrypoint, list(args)) exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) raise ChildFailedError( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-03-04_04:03:21 host : jean-zay-iam48-ib0 rank : 377 (local_rank: 1) exitcode : 1 (pid: 242926) error_file: /tmp/torchelastic_1v2_78zf/none_zkdpalbq/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:21 host : jean-zay-iam48-ib0 rank : 378 (local_rank: 2) exitcode : 1 (pid: 242927) error_file: /tmp/torchelastic_1v2_78zf/none_zkdpalbq/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict exec(code, run_globals) self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:21 host : jean-zay-iam48-ib0 rank : 379 (local_rank: 3) exitcode : 1 (pid: 242928) error_file: /tmp/torchelastic_1v2_78zf/none_zkdpalbq/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [4]: time : 2022-03-04_04:03:20 host : jean-zay-iam48-ib0 rank : 380 (local_rank: 4) exitcode : 1 (pid: 242929) error_file: /tmp/torchelastic_1v2_78zf/none_zkdpalbq/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code main() return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main return f(*args, **kwargs) exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> main() return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ raise ChildFailedError( raise ChildFailedError( raise ChildFailedError( return launch_agent(self._config, self._entrypoint, list(args)) torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-03-04_04:03:17 host : jean-zay-iam16-ib0 rank : 121 (local_rank: 1) exitcode : 1 (pid: 257947) error_file: /tmp/torchelastic_br2jb02z/none_2toiqfgx/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:18 host : jean-zay-iam31-ib0 rank : 244 (local_rank: 4) exitcode : 1 (pid: 249073) error_file: /tmp/torchelastic_akv0smqd/none_n2qqul8m/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-03-04_04:03:17 host : jean-zay-iam20-ib0 rank : 156 (local_rank: 4) exitcode : 1 (pid: 229016) error_file: /tmp/torchelastic_ynh8uw7t/none_o9onshx8/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:17 host : jean-zay-iam16-ib0 rank : 122 (local_rank: 2) exitcode : 1 (pid: 257948) error_file: /tmp/torchelastic_br2jb02z/none_2toiqfgx/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:17 host : jean-zay-iam20-ib0 rank : 152 (local_rank: 0) exitcode : 1 (pid: 229012) error_file: /tmp/torchelastic_ynh8uw7t/none_o9onshx8/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint raise ChildFailedError( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:17 host : jean-zay-iam16-ib0 rank : 124 (local_rank: 4) exitcode : 1 (pid: 257950) error_file: /tmp/torchelastic_br2jb02z/none_2toiqfgx/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:14 host : jean-zay-iam45-ib0 rank : 356 (local_rank: 4) exitcode : 1 (pid: 247069) error_file: /tmp/torchelastic_u3pn7nlm/none_93danifp/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [4]: time : 2022-03-04_04:03:17 host : jean-zay-iam16-ib0 rank : 125 (local_rank: 5) exitcode : 1 (pid: 257951) error_file: /tmp/torchelastic_br2jb02z/none_2toiqfgx/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:18 host : jean-zay-iam16-ib0 rank : 127 (local_rank: 7) exitcode : 1 (pid: 257953) error_file: /tmp/torchelastic_br2jb02z/none_2toiqfgx/attempt_0/7/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:16 host : jean-zay-iam16-ib0 rank : 120 (local_rank: 0) exitcode : 1 (pid: 257946) error_file: /tmp/torchelastic_br2jb02z/none_2toiqfgx/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [0]: time : 2022-03-04_04:03:18 host : jean-zay-iam25-ib0 rank : 194 (local_rank: 2) exitcode : 1 (pid: 245947) error_file: /tmp/torchelastic_zgly6wyk/none_ur58uh_8/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [1]: time : 2022-03-04_04:03:17 host : jean-zay-iam25-ib0 rank : 196 (local_rank: 4) exitcode : 1 (pid: 245949) error_file: /tmp/torchelastic_zgly6wyk/none_ur58uh_8/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:18 host : jean-zay-iam36-ib0 rank : 285 (local_rank: 5) exitcode : 1 (pid: 248201) error_file: /tmp/torchelastic_e_0ppd2k/none_85qwmbg_/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:18 host : jean-zay-iam21-ib0 rank : 164 (local_rank: 4) exitcode : 1 (pid: 231488) error_file: /tmp/torchelastic_o3ff8z3d/none_yg9aink9/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:17 host : jean-zay-iam06-ib0 rank : 47 (local_rank: 7) exitcode : 1 (pid: 287085) error_file: /tmp/torchelastic_mui43ycr/none_de5dz9l5/attempt_0/7/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-03-04_04:03:18 host : jean-zay-iam22-ib0 rank : 171 (local_rank: 3) exitcode : 1 (pid: 210936) error_file: /tmp/torchelastic_nkaknkmb/none_y4txzscw/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:18 host : jean-zay-iam22-ib0 rank : 172 (local_rank: 4) exitcode : 1 (pid: 210937) error_file: /tmp/torchelastic_nkaknkmb/none_y4txzscw/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:17 host : jean-zay-iam22-ib0 rank : 173 (local_rank: 5) exitcode : 1 (pid: 210938) error_file: /tmp/torchelastic_nkaknkmb/none_y4txzscw/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [4]: time : 2022-03-04_04:03:18 host : jean-zay-iam22-ib0 rank : 174 (local_rank: 6) exitcode : 1 (pid: 210939) error_file: /tmp/torchelastic_nkaknkmb/none_y4txzscw/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:18 host : jean-zay-iam22-ib0 rank : 175 (local_rank: 7) exitcode : 1 (pid: 210940) error_file: /tmp/torchelastic_nkaknkmb/none_y4txzscw/attempt_0/7/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:17 host : jean-zay-iam22-ib0 rank : 170 (local_rank: 2) exitcode : 1 (pid: 210935) error_file: /tmp/torchelastic_nkaknkmb/none_y4txzscw/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [0]: time : 2022-03-04_04:03:17 host : jean-zay-iam32-ib0 rank : 248 (local_rank: 0) exitcode : 1 (pid: 250776) error_file: /tmp/torchelastic_8jvvtzcb/none_rohg273m/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [1]: time : 2022-03-04_04:03:17 host : jean-zay-iam32-ib0 rank : 249 (local_rank: 1) exitcode : 1 (pid: 250777) error_file: /tmp/torchelastic_8jvvtzcb/none_rohg273m/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:17 host : jean-zay-iam32-ib0 rank : 250 (local_rank: 2) exitcode : 1 (pid: 250778) error_file: /tmp/torchelastic_8jvvtzcb/none_rohg273m/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:18 host : jean-zay-iam32-ib0 rank : 251 (local_rank: 3) exitcode : 1 (pid: 250779) error_file: /tmp/torchelastic_8jvvtzcb/none_rohg273m/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:17 host : jean-zay-iam32-ib0 rank : 253 (local_rank: 5) exitcode : 1 (pid: 250781) error_file: /tmp/torchelastic_8jvvtzcb/none_rohg273m/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [0]: time : 2022-03-04_04:03:17 host : jean-zay-iam39-ib0 rank : 305 (local_rank: 1) exitcode : 1 (pid: 228357) error_file: /tmp/torchelastic_vcfl8_ed/none_1ofckoam/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [1]: time : 2022-03-04_04:03:17 host : jean-zay-iam39-ib0 rank : 306 (local_rank: 2) exitcode : 1 (pid: 228358) error_file: /tmp/torchelastic_vcfl8_ed/none_1ofckoam/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:18 host : jean-zay-iam39-ib0 rank : 310 (local_rank: 6) exitcode : 1 (pid: 228362) error_file: /tmp/torchelastic_vcfl8_ed/none_1ofckoam/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [2]: time : 2022-03-04_04:03:16 host : jean-zay-iam39-ib0 rank : 308 (local_rank: 4) exitcode : 1 (pid: 228360) error_file: /tmp/torchelastic_vcfl8_ed/none_1ofckoam/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [0]: time : 2022-03-04_04:03:18 host : jean-zay-iam01-ib0 rank : 0 (local_rank: 0) exitcode : 1 (pid: 297991) error_file: /tmp/torchelastic_u0xq61is/none_jbeh2bpz/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [1]: time : 2022-03-04_04:03:16 host : jean-zay-iam01-ib0 rank : 4 (local_rank: 4) exitcode : 1 (pid: 297995) error_file: /tmp/torchelastic_u0xq61is/none_jbeh2bpz/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [0]: time : 2022-03-04_04:03:17 host : jean-zay-iam44-ib0 rank : 344 (local_rank: 0) exitcode : 1 (pid: 248341) error_file: /tmp/torchelastic_bdih5jjg/none_fc2_c36k/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [1]: time : 2022-03-04_04:03:18 host : jean-zay-iam44-ib0 rank : 345 (local_rank: 1) exitcode : 1 (pid: 248342) error_file: /tmp/torchelastic_bdih5jjg/none_fc2_c36k/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:18 host : jean-zay-iam44-ib0 rank : 347 (local_rank: 3) exitcode : 1 (pid: 248344) error_file: /tmp/torchelastic_bdih5jjg/none_fc2_c36k/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [4]: time : 2022-03-04_04:03:18 host : jean-zay-iam44-ib0 rank : 349 (local_rank: 5) exitcode : 1 (pid: 248346) error_file: /tmp/torchelastic_bdih5jjg/none_fc2_c36k/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:17 host : jean-zay-iam44-ib0 rank : 351 (local_rank: 7) exitcode : 1 (pid: 248348) error_file: /tmp/torchelastic_bdih5jjg/none_fc2_c36k/attempt_0/7/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [3]: time : 2022-03-04_04:03:16 host : jean-zay-iam44-ib0 rank : 348 (local_rank: 4) exitcode : 1 (pid: 248345) error_file: /tmp/torchelastic_bdih5jjg/none_fc2_c36k/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-03-04_04:03:18 host : jean-zay-iam43-ib0 rank : 338 (local_rank: 2) exitcode : 1 (pid: 248428) error_file: /tmp/torchelastic_lggibx3o/none_9wg_uheu/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:17 host : jean-zay-iam43-ib0 rank : 339 (local_rank: 3) exitcode : 1 (pid: 248429) error_file: /tmp/torchelastic_lggibx3o/none_9wg_uheu/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:15 host : jean-zay-iam43-ib0 rank : 340 (local_rank: 4) exitcode : 1 (pid: 248430) error_file: /tmp/torchelastic_lggibx3o/none_9wg_uheu/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [4]: time : 2022-03-04_04:03:19 host : jean-zay-iam43-ib0 rank : 342 (local_rank: 6) exitcode : 1 (pid: 248433) error_file: /tmp/torchelastic_lggibx3o/none_9wg_uheu/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:17 host : jean-zay-iam43-ib0 rank : 343 (local_rank: 7) exitcode : 1 (pid: 248434) error_file: /tmp/torchelastic_lggibx3o/none_9wg_uheu/attempt_0/7/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:15 host : jean-zay-iam43-ib0 rank : 336 (local_rank: 0) exitcode : 1 (pid: 248426) error_file: /tmp/torchelastic_lggibx3o/none_9wg_uheu/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2022-03-04_04:03:18 host : jean-zay-iam40-ib0 rank : 313 (local_rank: 1) exitcode : 1 (pid: 108416) error_file: /tmp/torchelastic_l77819s_/none_9q9l_afh/attempt_0/1/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [2]: time : 2022-03-04_04:03:16 host : jean-zay-iam40-ib0 rank : 314 (local_rank: 2) exitcode : 1 (pid: 108417) error_file: /tmp/torchelastic_l77819s_/none_9q9l_afh/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [3]: time : 2022-03-04_04:03:17 host : jean-zay-iam40-ib0 rank : 315 (local_rank: 3) exitcode : 1 (pid: 108418) error_file: /tmp/torchelastic_l77819s_/none_9q9l_afh/attempt_0/3/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/runpy.py", line 87, in _run_code error_file: /tmp/torchelastic_solv9rst/none_3t4blzzr/attempt_0/7/error.json traceback : Traceback (most recent call last): File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict error_file: /tmp/torchelastic_rf_i17w9/none_hrcjsyal/attempt_0/7/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [4]: time : 2022-03-04_04:03:22 host : jean-zay-iam29-ib0 rank : 231 (local_rank: 7) exitcode : 1 (pid: 251314) error_file: /tmp/torchelastic_pbe9bxkf/none_i9gj4lo6/attempt_0/7/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:21 loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:19 host : jean-zay-iam27-ib0 rank : 208 (local_rank: 0) exitcode : 1 (pid: 233472) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict host : jean-zay-iam12-ib0 rank : 89 (local_rank: 1) exitcode : 1 (pid: 254538) error_file: /tmp/torchelastic_rf_i17w9/none_hrcjsyal/attempt_0/1/error.json error_file: /tmp/torchelastic_solv9rst/none_3t4blzzr/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) exec(code, run_globals) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module> self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [0]: traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ time : 2022-03-04_04:03:22 host : jean-zay-iam29-ib0 rank : 226 (local_rank: 2) exitcode : 1 (pid: 251309) error_file: /tmp/torchelastic_pbe9bxkf/none_i9gj4lo6/attempt_0/2/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ main() File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main run(args) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ exitcode : 1 (pid: 264231) error_file: /tmp/torchelastic_2m77mxf3/none_u1r9yr9g/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [6]: time : 2022-03-04_04:03:21 host : jean-zay-iam30-ib0 rank : 238 (local_rank: 6) exitcode : 1 (pid: 250177) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [6]: time : 2022-03-04_04:03:21 host : jean-zay-iam03-ib0 rank : 22 (local_rank: 6) exitcode : 1 (pid: 264232) error_file: /tmp/torchelastic_iwh6pg11/none_abpr5ybr/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) error_file: /tmp/torchelastic_2m77mxf3/none_u1r9yr9g/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [7]: time : 2022-03-04_04:03:20 host : jean-zay-iam30-ib0 rank : 239 (local_rank: 7) exitcode : 1 (pid: 250178) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [7]: time : 2022-03-04_04:03:21 host : jean-zay-iam03-ib0 rank : 23 (local_rank: 7) exitcode : 1 (pid: 264233) error_file: /tmp/torchelastic_iwh6pg11/none_abpr5ybr/attempt_0/7/error.json traceback : Traceback (most recent call last): error_file: /tmp/torchelastic_2m77mxf3/none_u1r9yr9g/attempt_0/7/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [4]: time : 2022-03-04_04:03:18 host : jean-zay-iam30-ib0 rank : 236 (local_rank: 4) exitcode : 1 (pid: 250175) loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:19 host : jean-zay-iam03-ib0 rank : 16 (local_rank: 0) exitcode : 1 (pid: 264226) error_file: /tmp/torchelastic_iwh6pg11/none_abpr5ybr/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) error_file: /tmp/torchelastic_2m77mxf3/none_u1r9yr9g/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ return launch_agent(self._config, self._entrypoint, list(args)) File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:17 host : jean-zay-iam47-ib0 rank : 372 (local_rank: 4) exitcode : 1 (pid: 242645) error_file: /tmp/torchelastic_z75m30zs/none_ouaynzs8/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [6]: time : 2022-03-04_04:03:19 host : jean-zay-iam38-ib0 rank : 302 (local_rank: 6) exitcode : 1 (pid: 77371) error_file: /tmp/torchelastic_xf1hr23i/none_72szcbet/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [7]: time : 2022-03-04_04:03:19 host : jean-zay-iam38-ib0 rank : 303 (local_rank: 7) exitcode : 1 (pid: 77372) error_file: /tmp/torchelastic_xf1hr23i/none_72szcbet/attempt_0/7/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:18 host : jean-zay-iam38-ib0 rank : 296 (local_rank: 0) exitcode : 1 (pid: 77365) error_file: /tmp/torchelastic_xf1hr23i/none_72szcbet/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:21 host : jean-zay-iam48-ib0 rank : 381 (local_rank: 5) exitcode : 1 (pid: 242930) error_file: /tmp/torchelastic_1v2_78zf/none_zkdpalbq/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [6]: time : 2022-03-04_04:03:20 host : jean-zay-iam48-ib0 rank : 382 (local_rank: 6) exitcode : 1 (pid: 242931) error_file: /tmp/torchelastic_1v2_78zf/none_zkdpalbq/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [7]: time : 2022-03-04_04:03:21 host : jean-zay-iam48-ib0 rank : 383 (local_rank: 7) exitcode : 1 (pid: 242932) error_file: /tmp/torchelastic_1v2_78zf/none_zkdpalbq/attempt_0/7/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:19 host : jean-zay-iam48-ib0 rank : 376 (local_rank: 0) exitcode : 1 (pid: 242925) error_file: /tmp/torchelastic_1v2_78zf/none_zkdpalbq/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [6]: time : 2022-03-04_04:03:19 host : jean-zay-iam32-ib0 rank : 254 (local_rank: 6) exitcode : 1 (pid: 250782) error_file: /tmp/torchelastic_8jvvtzcb/none_rohg273m/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [7]: time : 2022-03-04_04:03:18 host : jean-zay-iam32-ib0 rank : 255 (local_rank: 7) exitcode : 1 (pid: 250783) error_file: /tmp/torchelastic_8jvvtzcb/none_rohg273m/attempt_0/7/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [4]: time : 2022-03-04_04:03:16 host : jean-zay-iam32-ib0 rank : 252 (local_rank: 4) exitcode : 1 (pid: 250780) error_file: /tmp/torchelastic_8jvvtzcb/none_rohg273m/attempt_0/4/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [4]: time : 2022-03-04_04:03:19 host : jean-zay-iam40-ib0 rank : 317 (local_rank: 5) exitcode : 1 (pid: 108420) error_file: /tmp/torchelastic_l77819s_/none_9q9l_afh/attempt_0/5/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [5]: time : 2022-03-04_04:03:19 host : jean-zay-iam40-ib0 rank : 318 (local_rank: 6) exitcode : 1 (pid: 108421) error_file: /tmp/torchelastic_l77819s_/none_9q9l_afh/attempt_0/6/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' [6]: time : 2022-03-04_04:03:18 host : jean-zay-iam40-ib0 rank : 319 (local_rank: 7) exitcode : 1 (pid: 108422) error_file: /tmp/torchelastic_l77819s_/none_9q9l_afh/attempt_0/7/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-03-04_04:03:16 host : jean-zay-iam40-ib0 rank : 312 (local_rank: 0) exitcode : 1 (pid: 108415) error_file: /tmp/torchelastic_l77819s_/none_9q9l_afh/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 140, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/training.py", line 410, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/checkpointing.py", line 276, in load_checkpoint loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2513, in load_checkpoint success = self._load_zero_checkpoint( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/engine.py", line 2675, in _load_zero_checkpoint self.optimizer.load_state_dict( File "/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed/runtime/bf16_optimizer.py", line 334, in load_state_dict self.clip_grad = current_rank_sd[CLIP_GRAD] KeyError: 'clip_grad' ============================================================ srun: error: jean-zay-iam17: task 16: Exited with exit code 1 srun: Terminating job step 202322.0 srun: error: jean-zay-iam29: task 28: Exited with exit code 1 srun: error: jean-zay-iam46: task 45: Exited with exit code 1 srun: error: jean-zay-iam25: task 24: Exited with exit code 1 slurmstepd: error: *** STEP 202322.0 ON jean-zay-iam01 CANCELLED AT 2022-03-04T04:03:27 *** srun: error: jean-zay-iam09: task 8: Exited with exit code 1 srun: error: jean-zay-iam42: task 41: Exited with exit code 1 srun: error: jean-zay-iam37: task 36: Exited with exit code 1 srun: error: jean-zay-iam21: task 20: Exited with exit code 1 srun: error: jean-zay-iam31: task 30: Exited with exit code 1 srun: error: jean-zay-iam22: task 21: Exited with exit code 1 srun: error: jean-zay-iam19: task 18: Exited with exit code 1 srun: error: jean-zay-iam15: task 14: Exited with exit code 1 srun: error: jean-zay-iam35: task 34: Exited with exit code 1 srun: error: jean-zay-iam03: task 2: Exited with exit code 1 srun: error: jean-zay-iam12: task 11: Exited with exit code 1 srun: error: jean-zay-iam45: task 44: Exited with exit code 1 srun: error: jean-zay-iam10: task 9: Exited with exit code 1 srun: error: jean-zay-iam20: task 19: Exited with exit code 1 srun: error: jean-zay-iam33: task 32: Exited with exit code 1 srun: error: jean-zay-iam27: task 26: Exited with exit code 1 srun: error: jean-zay-iam02: task 1: Exited with exit code 1 srun: error: jean-zay-iam14: task 13: Exited with exit code 1 srun: error: jean-zay-iam28: task 27: Exited with exit code 1 srun: error: jean-zay-iam11: task 10: Exited with exit code 1 srun: error: jean-zay-iam34: task 33: Exited with exit code 1 srun: error: jean-zay-iam38: task 37: Exited with exit code 1 srun: error: jean-zay-iam24: task 23: Exited with exit code 1 srun: error: jean-zay-iam48: task 47: Exited with exit code 1 srun: error: jean-zay-iam36: task 35: Exited with exit code 1 srun: error: jean-zay-iam32: task 31: Exited with exit code 1 srun: error: jean-zay-iam04: task 3: Exited with exit code 1 srun: error: jean-zay-iam30: task 29: Exited with exit code 1 srun: error: jean-zay-iam39: task 38: Exited with exit code 1 srun: error: jean-zay-iam08: task 7: Exited with exit code 1 srun: error: jean-zay-iam23: task 22: Exited with exit code 1 srun: error: jean-zay-iam16: task 15: Exited with exit code 1 srun: error: jean-zay-iam44: task 43: Exited with exit code 1 srun: error: jean-zay-iam26: task 25: Exited with exit code 1 srun: error: jean-zay-iam13: task 12: Exited with exit code 1 srun: error: jean-zay-iam06: task 5: Exited with exit code 1 srun: error: jean-zay-iam40: task 39: Exited with exit code 1 srun: error: jean-zay-iam07: task 6: Exited with exit code 1 srun: error: jean-zay-iam43: task 42: Exited with exit code 1 srun: error: jean-zay-iam01: task 0: Exited with exit code 1 srun: error: jean-zay-iam47: task 46: Exited with exit code 1 File "/gpfswork/rech/six/commun/conda/py38-pt111/lib/pyth File "/gpfswork/rsrun: error: jean-zay-iam18: task 17: Segmentation fault (core dumped) srun: error: jean-zay-iam05: task 4: Segmentation fault (core dumped) srun: error: jean-zay-iam41: task 40: Segmentation fault (core dumped) WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** [default0]:using world size: 384, data-parallel-size: 8, tensor-model-parallel size: 4, pipeline-model-parallel size: 12 [default0]:WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer with tokenizer_type:PretrainedFromHF [default0]:accumulate and all-reduce gradients in fp32 for bfloat16 data type. [default0]:using torch.bfloat16 for parameters ... [default0]:------------------------ arguments ------------------------ [default0]: abort_on_unmet_fused_kernel_constraints ......... True [default0]: accumulate_allreduce_grads_in_fp32 .............. True [default0]: adam_beta1 ...................................... 0.9 [default0]: adam_beta2 ...................................... 0.95 [default0]: adam_eps ........................................ 1e-08 [default0]: adlr_autoresume ................................. False [default0]: adlr_autoresume_interval ........................ 1000 [default0]: apply_query_key_layer_scaling ................... True [default0]: apply_residual_connection_post_layernorm ........ False [default0]: attention_dropout ............................... 0.1 [default0]: attention_softmax_in_fp32 ....................... False [default0]: bert_binary_head ................................ True [default0]: bert_load ....................................... None [default0]: bf16 ............................................ True [default0]: bias_dropout_fusion ............................. True [default0]: bias_gelu_fusion ................................ True [default0]: biencoder_projection_dim ........................ 0 [default0]: biencoder_shared_query_context_model ............ False [default0]: block_data_path ................................. None [default0]: checkpoint_activations .......................... True [default0]: checkpoint_in_cpu ............................... False [default0]: checkpoint_num_layers ........................... 1 [default0]: clip_grad ....................................... 1.0 [default0]: codecarbon_dir .................................. None [default0]: consumed_train_samples .......................... 0 [default0]: consumed_train_tokens ........................... 0 [default0]: consumed_valid_samples .......................... 0 [default0]: contigious_checkpointing ........................ False [default0]: cpu_optimizer ................................... False [default0]: cpu_torch_adam .................................. False [default0]: curriculum_learning ............................. False [default0]: data_impl ....................................... mmap [default0]: data_parallel_size .............................. 8 [default0]: data_path ....................................... None [default0]: dataloader_type ................................. single [default0]: DDP_impl ........................................ local [default0]: decoder_seq_length .............................. None [default0]: deepscale ....................................... False [default0]: deepscale_config ................................ None [default0]: deepspeed ....................................... True [default0]: deepspeed_activation_checkpointing .............. True [default0]: deepspeed_config ................................ ./ds_config.202330.json [default0]: deepspeed_mpi ................................... False [default0]: distribute_checkpointed_activations ............. False [default0]: distributed_backend ............................. nccl [default0]: embed_layernorm ................................. True [default0]: embedding_path .................................. None [default0]: encoder_seq_length .............................. 2048 [default0]: eod_mask_loss ................................... False [default0]: eval_interval ................................... 1000 [default0]: eval_iters ...................................... 10 [default0]: eval_only ....................................... None [default0]: evidence_data_path .............................. None [default0]: exit_duration_in_mins ........................... 5990 [default0]: exit_interval ................................... None [default0]: ffn_hidden_size ................................. 57344 [default0]: finetune ........................................ False [default0]: fp16 ............................................ False [default0]: fp16_lm_cross_entropy ........................... False [default0]: fp32_residual_connection ........................ False [default0]: gigaflos_no_embeds .............................. 0 [default0]: global_batch_size ............................... 2048 [default0]: glu_activation .................................. None [default0]: hidden_dropout .................................. 0.1 [default0]: hidden_size ..................................... 14336 [default0]: hysteresis ...................................... 2 [default0]: ict_head_size ................................... None [default0]: ict_load ........................................ None [default0]: img_dim ......................................... 224 [default0]: indexer_batch_size .............................. 128 [default0]: indexer_log_interval ............................ 1000 [default0]: init_method_std ................................. 0.0048 [default0]: init_method_xavier_uniform ...................... False [default0]: initial_loss_scale .............................. 4294967296 [default0]: kill_switch_path ................................ /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/kill-switch-tr11-176B-exp1 [default0]: kv_channels ..................................... 128 [default0]: layernorm_epsilon ............................... 1e-05 [default0]: lazy_mpu_init ................................... None [default0]: load ............................................ /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default0]: local_rank ...................................... None [default0]: log_batch_size_to_tensorboard ................... True [default0]: log_interval .................................... 1 [default0]: log_learning_rate_to_tensorboard ................ True [default0]: log_level ....................................... None [default0]: log_level_replica ............................... None [default0]: log_loss_scale_to_tensorboard ................... True [default0]: log_num_zeros_in_grad ........................... False [default0]: log_params_norm ................................. False [default0]: log_path ........................................ None [default0]: log_timers_to_tensorboard ....................... True [default0]: log_validation_ppl_to_tensorboard ............... True [default0]: loss_on_targets_only ............................ False [default0]: loss_scale ...................................... None [default0]: loss_scale_window ............................... 1000 [default0]: lr .............................................. 6e-05 [default0]: lr_decay_iters .................................. None [default0]: lr_decay_samples ................................ 200000000 [default0]: lr_decay_style .................................. cosine [default0]: lr_decay_tokens ................................. None [default0]: lr_warmup_fraction .............................. None [default0]: lr_warmup_iters ................................. 0 [default0]: lr_warmup_samples ............................... 183105 [default0]: make_vocab_size_divisible_by .................... 128 [default0]: mask_prob ....................................... 0.15 [default0]: masked_softmax_fusion ........................... True [default0]: max_position_embeddings ......................... 2048 [default0]: memory_centric_tiled_linear ..................... False [default0]: merge_file ...................................... None [default0]: micro_batch_size ................................ 2 [default0]: min_loss_scale .................................. 1.0 [default0]: min_lr .......................................... 6e-06 [default0]: mmap_warmup ..................................... False [default0]: no_load_optim ................................... None [default0]: no_load_rng ..................................... None [default0]: no_save_optim ................................... None [default0]: no_save_rng ..................................... None [default0]: num_attention_heads ............................. 112 [default0]: num_channels .................................... 3 [default0]: num_classes ..................................... 1000 [default0]: num_layers ...................................... 70 [default0]: num_layers_per_virtual_pipeline_stage ........... None [default0]: num_workers ..................................... 2 [default0]: onnx_safe ....................................... None [default0]: openai_gelu ..................................... False [default0]: optimizer ....................................... adam [default0]: override_lr_scheduler ........................... False [default0]: pad_vocab_size_to ............................... 250880 [default0]: params_dtype .................................... torch.bfloat16 [default0]: partition_activations ........................... False [default0]: patch_dim ....................................... 16 [default0]: pipeline_model_parallel_size .................... 12 [default0]: position_embedding_type ......................... PositionEmbeddingType.alibi [default0]: pp_partition_method ............................. type:transformer|embedding [default0]: profile_backward ................................ False [default0]: query_in_block_prob ............................. 0.1 [default0]: rampup_batch_size ............................... ['16', '16', '9_765_625'] [default0]: rank ............................................ 0 [default0]: remote_device ................................... none [default0]: reset_attention_mask ............................ False [default0]: reset_position_ids .............................. False [default0]: retriever_report_topk_accuracies ................ [] [default0]: retriever_score_scaling ......................... False [default0]: retriever_seq_length ............................ 256 [default0]: reweight_loss_based_on_position_frequency ....... False [default0]: sample_rate ..................................... 1.0 [default0]: save ............................................ /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints [default0]: save_interval ................................... 500 [default0]: scatter_gather_tensors_in_pipeline .............. True [default0]: scattered_embeddings ............................ False [default0]: seed ............................................ 42 [default0]: seq_length ...................................... 2048 [default0]: sgd_momentum .................................... 0.9 [default0]: short_seq_prob .................................. 0.1 [default0]: skip_train_iteration_range ...................... None [default0]: split ........................................... None [default0]: split_transformers .............................. False [default0]: synchronize_each_layer .......................... False [default0]: tensor_model_parallel_size ...................... 4 [default0]: tensorboard_dir ................................. /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/tr11-176B-ml-logs/tensorboard [default0]: tensorboard_log_interval ........................ 1 [default0]: tensorboard_queue_size .......................... 5 [default0]: test_weighted_split_names ....................... ['test'] [default0]: test_weighted_split_paths ....................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']] [default0]: test_weighted_split_paths_path .................. None [default0]: test_weighted_split_splits ...................... [['0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0', '0.999:1.0']] [default0]: test_weighted_split_weights ..................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']] [default0]: tile_factor ..................................... 1 [default0]: titles_data_path ................................ None [default0]: tokenizer_name_or_path .......................... bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k [default0]: tokenizer_type .................................. PretrainedFromHF [default0]: train_iters ..................................... None [default0]: train_samples ................................... 220000000 [default0]: train_tokens .................................... None [default0]: train_weighted_split_names ...................... ['train'] [default0]: train_weighted_split_paths ...................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']] [default0]: train_weighted_split_paths_path ................. None [default0]: train_weighted_split_splits ..................... [['0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949', '0:0.949']] [default0]: train_weighted_split_weights .................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']] [default0]: use_bnb_optimizer ............................... False [default0]: use_checkpoint_lr_scheduler ..................... False [default0]: use_contiguous_buffers_in_ddp ................... True [default0]: use_cpu_initialization .......................... None [default0]: use_one_sent_docs ............................... False [default0]: use_pin_memory .................................. False [default0]: valid_weighted_split_names ...................... ['valid'] [default0]: valid_weighted_split_paths ...................... [['/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document', '/gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document']] [default0]: valid_weighted_split_paths_path ................. None [default0]: valid_weighted_split_splits ..................... [['0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999', '0.949:0.999']] [default0]: valid_weighted_split_weights .................... [['0.0870675668625', '0.02073140422625', '0.12469955763749999', '0.12418189776749998', '0.0029046043375', '0.12469955763249999', '0.06592745982875', '0.12094050073499998', '0.0310664842075', '0.04546307670125', '0.12706392680625', '0.1246995576325', '0.0005544056375']] [default0]: virtual_pipeline_model_parallel_size ............ None [default0]: vocab_extra_ids ................................. 0 [default0]: vocab_file ...................................... None [default0]: weight_decay .................................... 0.1 [default0]: world_size ...................................... 384 [default0]: zero_allgather_bucket_size ...................... 0.0 [default0]: zero_contigious_gradients ....................... False [default0]: zero_reduce_bucket_size ......................... 0.0 [default0]: zero_reduce_scatter ............................. False [default0]: zero_stage ...................................... 0 [default0]:-------------------- end of arguments --------------------- [default0]:will use batch size rampup starting from global batch size 16 to global batch size 2048 with batch size increments 16 over 9765625 samples. [default0]:> building PretrainedFromHF tokenizer ... [default0]: vocab file is un-used. loading tokenizer from pre-trained model [default0]:Offline mode: forcing local_files_only=True [default0]:Offline mode: forcing local_files_only=True [default0]:Can't load following files from cache: ['added_tokens_file'] and cannot check if these files are necessary for the tokenizer to operate. [default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/special_tokens_map.json from cache at /gpfswork/rech/six/commun/models/b0b3428eb9bea3ef62a6e9983742117e4860f4ec1af66eebce1702b8ec7cb364.9d6cd81ef646692fb1c169a880161ea1cb95f49694f220aced9b704b457e51dd [default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/tokenizer_config.json from cache at /gpfswork/rech/six/commun/models/31fb66a88196017b3a12c4798e55bcf8a11b312b42dd9429c83f7237c0a8a807.e683c1a11fe6388761e34fd7cddbcd77f3552cefb70e9aca4a4cc72c027c8f40 [default0]:loading file https://huggingface.co/bigscience-catalogue-data-dev/byte-level-bpe-tokenizer-nfkc-250k/resolve/main/tokenizer.json from cache at /gpfswork/rech/six/commun/models/b28b4c1d8aed4c72b765cce6a9a7ce8c5460d05a5b4ea6fa5855dff6a721d171.397b0d7316cb89fa15f0bebce2bd6c5e71e92a14e95de167940173a60253b03e [default0]: > padded vocab (size: 250680) with 200 dummy tokens (new size: 250880) [default0]:DeepSpeed general environment info: [default0]:torch install path ............... ['/gpfswork/rech/six/commun/conda/py38-pt111/lib/python3.8/site-packages/torch'] [default0]:torch version .................... 1.11.0+cu115 [default0]:torch cuda version ............... 11.5 [default0]:nvcc version ..................... 11.4 [default0]:deepspeed install path ........... ['/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/DeepSpeed-master-bf16/deepspeed'] [default0]:deepspeed info ................... 0.6.0+ed26ef4, ed26ef4, olruwase/bf16-updates [default0]:deepspeed wheel compiled w. ...... torch 1.11, cuda 11.5 [default0]:**** Git info for Megatron: git_hash=0415583 git_branch=sync-meg-lm **** [default0]:> initializing torch distributed ... [default7]:> setting tensorboard ... [default0]:> initializing tensor model parallel with size 4 [default0]:> initializing pipeline model parallel with size 12 [default0]:> setting random seeds to 42 ... [default0]:[2022-03-04 04:08:55,637] [INFO] [checkpointing.py:226:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2760 and data parallel seed: 42 [default0]:> compiling dataset index builder ... [default0]:make: Entering directory '/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/data' [default0]:make: Nothing to be done for 'default'. [default0]:make: Leaving directory '/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/data' [default0]:>>> done with dataset index builder. Compilation time: 0.108 seconds [default0]:> compiling and loading fused kernels ... [default0]:Detected CUDA files, patching ldflags [default0]:Emitting ninja build file /gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... [default0]:Building extension module scaled_upper_triang_masked_softmax_cuda... [default0]:Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [default0]:ninja: no work to do. [default0]:Loading extension module scaled_upper_triang_masked_softmax_cuda... [default0]:Detected CUDA files, patching ldflags [default0]:Emitting ninja build file /gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... [default0]:Building extension module scaled_masked_softmax_cuda... [default0]:Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [default0]:ninja: no work to do. [default0]:Loading extension module scaled_masked_softmax_cuda... [default0]:Detected CUDA files, patching ldflags [default0]:Emitting ninja build file /gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... [default0]:Building extension module fused_mix_prec_layer_norm_cuda... [default0]:Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [default0]:ninja: no work to do. [default0]:Loading extension module fused_mix_prec_layer_norm_cuda... [default0]:>>> done with compiling and loading fused kernels. Compilation time: 10.004 seconds [default0]:time to initialize megatron (seconds): 81.752 [default0]:[after megatron is initialized] datetime: 2022-03-04 04:09:05 [default0]:building GPT model ... [default0]:[2022-03-04 04:09:05,789] [INFO] [utils.py:828:see_memory_usage] Before Building Model [default0]:[2022-03-04 04:09:05,789] [INFO] [utils.py:829:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [default0]:[2022-03-04 04:09:05,790] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.22 GB, percent = 8.6% [default0]:SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None [default0]:Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=0, model=1): 1, ProcessCoord(pipe=0, data=0, model=2): 2, ProcessCoord(pipe=0, data=0, model=3): 3, ProcessCoord(pipe=0, data=1, model=0): 4, ProcessCoord(pipe=0, data=1, model=1): 5, ProcessCoord(pipe=0, data=1, model=2): 6, ProcessCoord(pipe=0, data=1, model=3): 7, ProcessCoord(pipe=0, data=2, model=0): 8, ProcessCoord(pipe=0, data=2, model=1): 9, ProcessCoord(pipe=0, data=2, model=2): 10, ProcessCoord(pipe=0, data=2, model=3): 11, ProcessCoord(pipe=0, data=3, model=0): 12, ProcessCoord(pipe=0, data=3, model=1): 13, ProcessCoord(pipe=0, data=3, model=2): 14, ProcessCoord(pipe=0, data=3, model=3): 15, ProcessCoord(pipe=0, data=4, model=0): 16, ProcessCoord(pipe=0, data=4, model=1): 17, ProcessCoord(pipe=0, data=4, model=2): 18, ProcessCoord(pipe=0, data=4, model=3): 19, ProcessCoord(pipe=0, data=5, model=0): 20, ProcessCoord(pipe=0, data=5, model=1): 21, ProcessCoord(pipe=0, data=5, model=2): 22, ProcessCoord(pipe=0, data=5, model=3): 23, ProcessCoord(pipe=0, data=6, model=0): 24, ProcessCoord(pipe=0, data=6, model=1): 25, ProcessCoord(pipe=0, data=6, model=2): 26, ProcessCoord(pipe=0, data=6, model=3): 27, ProcessCoord(pipe=0, data=7, model=0): 28, ProcessCoord(pipe=0, data=7, model=1): 29, ProcessCoord(pipe=0, data=7, model=2): 30, ProcessCoord(pipe=0, data=7, model=3): 31, ProcessCoord(pipe=1, data=0, model=0): 32, ProcessCoord(pipe=1, data=0, model=1): 33, ProcessCoord(pipe=1, data=0, model=2): 34, ProcessCoord(pipe=1, data=0, model=3): 35, ProcessCoord(pipe=1, data=1, model=0): 36, ProcessCoord(pipe=1, data=1, model=1): 37, ProcessCoord(pipe=1, data=1, model=2): 38, ProcessCoord(pipe=1, data=1, model=3): 39, ProcessCoord(pipe=1, data=2, model=0): 40, ProcessCoord(pipe=1, data=2, model=1): 41, ProcessCoord(pipe=1, data=2, model=2): 42, ProcessCoord(pipe=1, data=2, model=3): 43, ProcessCoord(pipe=1, data=3, model=0): 44, ProcessCoord(pipe=1, data=3, model=1): 45, ProcessCoord(pipe=1, data=3, model=2): 46, ProcessCoord(pipe=1, data=3, model=3): 47, ProcessCoord(pipe=1, data=4, model=0): 48, ProcessCoord(pipe=1, data=4, model=1): 49, ProcessCoord(pipe=1, data=4, model=2): 50, ProcessCoord(pipe=1, data=4, model=3): 51, ProcessCoord(pipe=1, data=5, model=0): 52, ProcessCoord(pipe=1, data=5, model=1): 53, ProcessCoord(pipe=1, data=5, model=2): 54, ProcessCoord(pipe=1, data=5, model=3): 55, ProcessCoord(pipe=1, data=6, model=0): 56, ProcessCoord(pipe=1, data=6, model=1): 57, ProcessCoord(pipe=1, data=6, model=2): 58, ProcessCoord(pipe=1, data=6, model=3): 59, ProcessCoord(pipe=1, data=7, model=0): 60, ProcessCoord(pipe=1, data=7, model=1): 61, ProcessCoord(pipe=1, data=7, model=2): 62, ProcessCoord(pipe=1, data=7, model=3): 63, ProcessCoord(pipe=2, data=0, model=0): 64, ProcessCoord(pipe=2, data=0, model=1): 65, ProcessCoord(pipe=2, data=0, model=2): 66, ProcessCoord(pipe=2, data=0, model=3): 67, ProcessCoord(pipe=2, data=1, model=0): 68, ProcessCoord(pipe=2, data=1, model=1): 69, ProcessCoord(pipe=2, data=1, model=2): 70, ProcessCoord(pipe=2, data=1, model=3): 71, ProcessCoord(pipe=2, data=2, model=0): 72, ProcessCoord(pipe=2, data=2, model=1): 73, ProcessCoord(pipe=2, data=2, model=2): 74, ProcessCoord(pipe=2, data=2, model=3): 75, ProcessCoord(pipe=2, data=3, model=0): 76, ProcessCoord(pipe=2, data=3, model=1): 77, ProcessCoord(pipe=2, data=3, model=2): 78, ProcessCoord(pipe=2, data=3, model=3): 79, ProcessCoord(pipe=2, data=4, model=0): 80, ProcessCoord(pipe=2, data=4, model=1): 81, ProcessCoord(pipe=2, data=4, model=2): 82, ProcessCoord(pipe=2, data=4, model=3): 83, ProcessCoord(pipe=2, data=5, model=0): 84, ProcessCoord(pipe=2, data=5, model=1): 85, ProcessCoord(pipe=2, data=5, model=2): 86, ProcessCoord(pipe=2, data=5, model=3): 87, ProcessCoord(pipe=2, data=6, model=0): 88, ProcessCoord(pipe=2, data=6, model=1): 89, ProcessCoord(pipe=2, data=6, model=2): 90, ProcessCoord(pipe=2, data=6, model=3): 91, ProcessCoord(pipe=2, data=7, model=0): 92, ProcessCoord(pipe=2, data=7, model=1): 93, ProcessCoord(pipe=2, data=7, model=2): 94, ProcessCoord(pipe=2, data=7, model=3): 95, ProcessCoord(pipe=3, data=0, model=0): 96, ProcessCoord(pipe=3, data=0, model=1): 97, ProcessCoord(pipe=3, data=0, model=2): 98, ProcessCoord(pipe=3, data=0, model=3): 99, ProcessCoord(pipe=3, data=1, model=0): 100, ProcessCoord(pipe=3, data=1, model=1): 101, ProcessCoord(pipe=3, data=1, model=2): 102, ProcessCoord(pipe=3, data=1, model=3): 103, ProcessCoord(pipe=3, data=2, model=0): 104, ProcessCoord(pipe=3, data=2, model=1): 105, ProcessCoord(pipe=3, data=2, model=2): 106, ProcessCoord(pipe=3, data=2, model=3): 107, ProcessCoord(pipe=3, data=3, model=0): 108, ProcessCoord(pipe=3, data=3, model=1): 109, ProcessCoord(pipe=3, data=3, model=2): 110, ProcessCoord(pipe=3, data=3, model=3): 111, ProcessCoord(pipe=3, data=4, model=0): 112, ProcessCoord(pipe=3, data=4, model=1): 113, ProcessCoord(pipe=3, data=4, model=2): 114, ProcessCoord(pipe=3, data=4, model=3): 115, ProcessCoord(pipe=3, data=5, model=0): 116, ProcessCoord(pipe=3, data=5, model=1): 117, ProcessCoord(pipe=3, data=5, model=2): 118, ProcessCoord(pipe=3, data=5, model=3): 119, ProcessCoord(pipe=3, data=6, model=0): 120, ProcessCoord(pipe=3, data=6, model=1): 121, ProcessCoord(pipe=3, data=6, model=2): 122, ProcessCoord(pipe=3, data=6, model=3): 123, ProcessCoord(pipe=3, data=7, model=0): 124, ProcessCoord(pipe=3, data=7, model=1): 125, ProcessCoord(pipe=3, data=7, model=2): 126, ProcessCoord(pipe=3, data=7, model=3): 127, ProcessCoord(pipe=4, data=0, model=0): 128, ProcessCoord(pipe=4, data=0, model=1): 129, ProcessCoord(pipe=4, data=0, model=2): 130, ProcessCoord(pipe=4, data=0, model=3): 131, ProcessCoord(pipe=4, data=1, model=0): 132, ProcessCoord(pipe=4, data=1, model=1): 133, ProcessCoord(pipe=4, data=1, model=2): 134, ProcessCoord(pipe=4, data=1, model=3): 135, ProcessCoord(pipe=4, data=2, model=0): 136, ProcessCoord(pipe=4, data=2, model=1): 137, ProcessCoord(pipe=4, data=2, model=2): 138, ProcessCoord(pipe=4, data=2, model=3): 139, ProcessCoord(pipe=4, data=3, model=0): 140, ProcessCoord(pipe=4, data=3, model=1): 141, ProcessCoord(pipe=4, data=3, model=2): 142, ProcessCoord(pipe=4, data=3, model=3): 143, ProcessCoord(pipe=4, data=4, model=0): 144, ProcessCoord(pipe=4, data=4, model=1): 145, ProcessCoord(pipe=4, data=4, model=2): 146, ProcessCoord(pipe=4, data=4, model=3): 147, ProcessCoord(pipe=4, data=5, model=0): 148, ProcessCoord(pipe=4, data=5, model=1): 149, ProcessCoord(pipe=4, data=5, model=2): 150, ProcessCoord(pipe=4, data=5, model=3): 151, ProcessCoord(pipe=4, data=6, model=0): 152, ProcessCoord(pipe=4, data=6, model=1): 153, ProcessCoord(pipe=4, data=6, model=2): 154, ProcessCoord(pipe=4, data=6, model=3): 155, ProcessCoord(pipe=4, data=7, model=0): 156, ProcessCoord(pipe=4, data=7, model=1): 157, ProcessCoord(pipe=4, data=7, model=2): 158, ProcessCoord(pipe=4, data=7, model=3): 159, ProcessCoord(pipe=5, data=0, model=0): 160, ProcessCoord(pipe=5, data=0, model=1): 161, ProcessCoord(pipe=5, data=0, model=2): 162, ProcessCoord(pipe=5, data=0, model=3): 163, ProcessCoord(pipe=5, data=1, model=0): 164, ProcessCoord(pipe=5, data=1, model=1): 165, ProcessCoord(pipe=5, data=1, model=2): 166, ProcessCoord(pipe=5, data=1, model=3): 167, ProcessCoord(pipe=5, data=2, model=0): 168, ProcessCoord(pipe=5, data=2, model=1): 169, ProcessCoord(pipe=5, data=2, model=2): 170, ProcessCoord(pipe=5, data=2, model=3): 171, ProcessCoord(pipe=5, data=3, model=0): 172, ProcessCoord(pipe=5, data=3, model=1): 173, ProcessCoord(pipe=5, data=3, model=2): 174, ProcessCoord(pipe=5, data=3, model=3): 175, ProcessCoord(pipe=5, data=4, model=0): 176, ProcessCoord(pipe=5, data=4, model=1): 177, ProcessCoord(pipe=5, data=4, model=2): 178, ProcessCoord(pipe=5, data=4, model=3): 179, ProcessCoord(pipe=5, data=5, model=0): 180, ProcessCoord(pipe=5, data=5, model=1): 181, ProcessCoord(pipe=5, data=5, model=2): 182, ProcessCoord(pipe=5, data=5, model=3): 183, ProcessCoord(pipe=5, data=6, model=0): 184, ProcessCoord(pipe=5, data=6, model=1): 185, ProcessCoord(pipe=5, data=6, model=2): 186, ProcessCoord(pipe=5, data=6, model=3): 187, ProcessCoord(pipe=5, data=7, model=0): 188, ProcessCoord(pipe=5, data=7, model=1): 189, ProcessCoord(pipe=5, data=7, model=2): 190, ProcessCoord(pipe=5, data=7, model=3): 191, ProcessCoord(pipe=6, data=0, model=0): 192, ProcessCoord(pipe=6, data=0, model=1): 193, ProcessCoord(pipe=6, data=0, model=2): 194, ProcessCoord(pipe=6, data=0, model=3): 195, ProcessCoord(pipe=6, data=1, model=0): 196, ProcessCoord(pipe=6, data=1, model=1): 197, ProcessCoord(pipe=6, data=1, model=2): 198, ProcessCoord(pipe=6, data=1, model=3): 199, ProcessCoord(pipe=6, data=2, model=0): 200, ProcessCoord(pipe=6, data=2, model=1): 201, ProcessCoord(pipe=6, data=2, model=2): 202, ProcessCoord(pipe=6, data=2, model=3): 203, ProcessCoord(pipe=6, data=3, model=0): 204, ProcessCoord(pipe=6, data=3, model=1): 205, ProcessCoord(pipe=6, data=3, model=2): 206, ProcessCoord(pipe=6, data=3, model=3): 207, ProcessCoord(pipe=6, data=4, model=0): 208, ProcessCoord(pipe=6, data=4, model=1): 209, ProcessCoord(pipe=6, data=4, model=2): 210, ProcessCoord(pipe=6, data=4, model=3): 211, ProcessCoord(pipe=6, data=5, model=0): 212, ProcessCoord(pipe=6, data=5, model=1): 213, ProcessCoord(pipe=6, data=5, model=2): 214, ProcessCoord(pipe=6, data=5, model=3): 215, ProcessCoord(pipe=6, data=6, model=0): 216, ProcessCoord(pipe=6, data=6, model=1): 217, ProcessCoord(pipe=6, data=6, model=2): 218, ProcessCoord(pipe=6, data=6, model=3): 219, ProcessCoord(pipe=6, data=7, model=0): 220, ProcessCoord(pipe=6, data=7, model=1): 221, ProcessCoord(pipe=6, data=7, model=2): 222, ProcessCoord(pipe=6, data=7, model=3): 223, ProcessCoord(pipe=7, data=0, model=0): 224, ProcessCoord(pipe=7, data=0, model=1): 225, ProcessCoord(pipe=7, data=0, model=2): 226, ProcessCoord(pipe=7, data=0, model=3): 227, ProcessCoord(pipe=7, data=1, model=0): 228, ProcessCoord(pipe=7, data=1, model=1): 229, ProcessCoord(pipe=7, data=1, model=2): 230, ProcessCoord(pipe=7, data=1, model=3): 231, ProcessCoord(pipe=7, data=2, model=0): 232, ProcessCoord(pipe=7, data=2, model=1): 233, ProcessCoord(pipe=7, data=2, model=2): 234, ProcessCoord(pipe=7, data=2, model=3): 235, ProcessCoord(pipe=7, data=3, model=0): 236, ProcessCoord(pipe=7, data=3, model=1): 237, ProcessCoord(pipe=7, data=3, model=2): 238, ProcessCoord(pipe=7, data=3, model=3): 239, ProcessCoord(pipe=7, data=4, model=0): 240, ProcessCoord(pipe=7, data=4, model=1): 241, ProcessCoord(pipe=7, data=4, model=2): 242, ProcessCoord(pipe=7, data=4, model=3): 243, ProcessCoord(pipe=7, data=5, model=0): 244, ProcessCoord(pipe=7, data=5, model=1): 245, ProcessCoord(pipe=7, data=5, model=2): 246, ProcessCoord(pipe=7, data=5, model=3): 247, ProcessCoord(pipe=7, data=6, model=0): 248, ProcessCoord(pipe=7, data=6, model=1): 249, ProcessCoord(pipe=7, data=6, model=2): 250, ProcessCoord(pipe=7, data=6, model=3): 251, ProcessCoord(pipe=7, data=7, model=0): 252, ProcessCoord(pipe=7, data=7, model=1): 253, ProcessCoord(pipe=7, data=7, model=2): 254, ProcessCoord(pipe=7, data=7, model=3): 255, ProcessCoord(pipe=8, data=0, model=0): 256, ProcessCoord(pipe=8, data=0, model=1): 257, ProcessCoord(pipe=8, data=0, model=2): 258, ProcessCoord(pipe=8, data=0, model=3): 259, ProcessCoord(pipe=8, data=1, model=0): 260, ProcessCoord(pipe=8, data=1, model=1): 261, ProcessCoord(pipe=8, data=1, model=2): 262, ProcessCoord(pipe=8, data=1, model=3): 263, ProcessCoord(pipe=8, data=2, model=0): 264, ProcessCoord(pipe=8, data=2, model=1): 265, ProcessCoord(pipe=8, data=2, model=2): 266, ProcessCoord(pipe=8, data=2, model=3): 267, ProcessCoord(pipe=8, data=3, model=0): 268, ProcessCoord(pipe=8, data=3, model=1): 269, ProcessCoord(pipe=8, data=3, model=2): 270, ProcessCoord(pipe=8, data=3, model=3): 271, ProcessCoord(pipe=8, data=4, model=0): 272, ProcessCoord(pipe=8, data=4, model=1): 273, ProcessCoord(pipe=8, data=4, model=2): 274, ProcessCoord(pipe=8, data=4, model=3): 275, ProcessCoord(pipe=8, data=5, model=0): 276, ProcessCoord(pipe=8, data=5, model=1): 277, ProcessCoord(pipe=8, data=5, model=2): 278, ProcessCoord(pipe=8, data=5, model=3): 279, ProcessCoord(pipe=8, data=6, model=0): 280, ProcessCoord(pipe=8, data=6, model=1): 281, ProcessCoord(pipe=8, data=6, model=2): 282, ProcessCoord(pipe=8, data=6, model=3): 283, ProcessCoord(pipe=8, data=7, model=0): 284, ProcessCoord(pipe=8, data=7, model=1): 285, ProcessCoord(pipe=8, data=7, model=2): 286, ProcessCoord(pipe=8, data=7, model=3): 287, ProcessCoord(pipe=9, data=0, model=0): 288, ProcessCoord(pipe=9, data=0, model=1): 289, ProcessCoord(pipe=9, data=0, model=2): 290, ProcessCoord(pipe=9, data=0, model=3): 291, ProcessCoord(pipe=9, data=1, model=0): 292, ProcessCoord(pipe=9, data=1, model=1): 293, ProcessCoord(pipe=9, data=1, model=2): 294, ProcessCoord(pipe=9, data=1, model=3): 295, ProcessCoord(pipe=9, data=2, model=0): 296, ProcessCoord(pipe=9, data=2, model=1): 297, ProcessCoord(pipe=9, data=2, model=2): 298, ProcessCoord(pipe=9, data=2, model=3): 299, ProcessCoord(pipe=9, data=3, model=0): 300, ProcessCoord(pipe=9, data=3, model=1): 301, ProcessCoord(pipe=9, data=3, model=2): 302, ProcessCoord(pipe=9, data=3, model=3): 303, ProcessCoord(pipe=9, data=4, model=0): 304, ProcessCoord(pipe=9, data=4, model=1): 305, ProcessCoord(pipe=9, data=4, model=2): 306, ProcessCoord(pipe=9, data=4, model=3): 307, ProcessCoord(pipe=9, data=5, model=0): 308, ProcessCoord(pipe=9, data=5, model=1): 309, ProcessCoord(pipe=9, data=5, model=2): 310, ProcessCoord(pipe=9, data=5, model=3): 311, ProcessCoord(pipe=9, data=6, model=0): 312, ProcessCoord(pipe=9, data=6, model=1): 313, ProcessCoord(pipe=9, data=6, model=2): 314, ProcessCoord(pipe=9, data=6, model=3): 315, ProcessCoord(pipe=9, data=7, model=0): 316, ProcessCoord(pipe=9, data=7, model=1): 317, ProcessCoord(pipe=9, data=7, model=2): 318, ProcessCoord(pipe=9, data=7, model=3): 319, ProcessCoord(pipe=10, data=0, model=0): 320, ProcessCoord(pipe=10, data=0, model=1): 321, ProcessCoord(pipe=10, data=0, model=2): 322, ProcessCoord(pipe=10, data=0, model=3): 323, ProcessCoord(pipe=10, data=1, model=0): 324, ProcessCoord(pipe=10, data=1, model=1): 325, ProcessCoord(pipe=10, data=1, model=2): 326, ProcessCoord(pipe=10, data=1, model=3): 327, ProcessCoord(pipe=10, data=2, model=0): 328, ProcessCoord(pipe=10, data=2, model=1): 329, ProcessCoord(pipe=10, data=2, model=2): 330, ProcessCoord(pipe=10, data=2, model=3): 331, ProcessCoord(pipe=10, data=3, model=0): 332, ProcessCoord(pipe=10, data=3, model=1): 333, ProcessCoord(pipe=10, data=3, model=2): 334, ProcessCoord(pipe=10, data=3, model=3): 335, ProcessCoord(pipe=10, data=4, model=0): 336, ProcessCoord(pipe=10, data=4, model=1): 337, ProcessCoord(pipe=10, data=4, model=2): 338, ProcessCoord(pipe=10, data=4, model=3): 339, ProcessCoord(pipe=10, data=5, model=0): 340, ProcessCoord(pipe=10, data=5, model=1): 341, ProcessCoord(pipe=10, data=5, model=2): 342, ProcessCoord(pipe=10, data=5, model=3): 343, ProcessCoord(pipe=10, data=6, model=0): 344, ProcessCoord(pipe=10, data=6, model=1): 345, ProcessCoord(pipe=10, data=6, model=2): 346, ProcessCoord(pipe=10, data=6, model=3): 347, ProcessCoord(pipe=10, data=7, model=0): 348, ProcessCoord(pipe=10, data=7, model=1): 349, ProcessCoord(pipe=10, data=7, model=2): 350, ProcessCoord(pipe=10, data=7, model=3): 351, ProcessCoord(pipe=11, data=0, model=0): 352, ProcessCoord(pipe=11, data=0, model=1): 353, ProcessCoord(pipe=11, data=0, model=2): 354, ProcessCoord(pipe=11, data=0, model=3): 355, ProcessCoord(pipe=11, data=1, model=0): 356, ProcessCoord(pipe=11, data=1, model=1): 357, ProcessCoord(pipe=11, data=1, model=2): 358, ProcessCoord(pipe=11, data=1, model=3): 359, ProcessCoord(pipe=11, data=2, model=0): 360, ProcessCoord(pipe=11, data=2, model=1): 361, ProcessCoord(pipe=11, data=2, model=2): 362, ProcessCoord(pipe=11, data=2, model=3): 363, ProcessCoord(pipe=11, data=3, model=0): 364, ProcessCoord(pipe=11, data=3, model=1): 365, ProcessCoord(pipe=11, data=3, model=2): 366, ProcessCoord(pipe=11, data=3, model=3): 367, ProcessCoord(pipe=11, data=4, model=0): 368, ProcessCoord(pipe=11, data=4, model=1): 369, ProcessCoord(pipe=11, data=4, model=2): 370, ProcessCoord(pipe=11, data=4, model=3): 371, ProcessCoord(pipe=11, data=5, model=0): 372, ProcessCoord(pipe=11, data=5, model=1): 373, ProcessCoord(pipe=11, data=5, model=2): 374, ProcessCoord(pipe=11, data=5, model=3): 375, ProcessCoord(pipe=11, data=6, model=0): 376, ProcessCoord(pipe=11, data=6, model=1): 377, ProcessCoord(pipe=11, data=6, model=2): 378, ProcessCoord(pipe=11, data=6, model=3): 379, ProcessCoord(pipe=11, data=7, model=0): 380, ProcessCoord(pipe=11, data=7, model=1): 381, ProcessCoord(pipe=11, data=7, model=2): 382, ProcessCoord(pipe=11, data=7, model=3): 383} [default0]:[2022-03-04 04:09:07,781] [INFO] [module.py:365:_partition_layers] Partitioning pipeline stages with method type:transformer|embedding [default0]:stage=0 layers=8 [default0]: 0: _to_float16 [default0]: 1: EmbeddingPipe [default0]: 2: <lambda> [default0]: 3: ParallelTransformerLayerPipe [default0]: 4: ParallelTransformerLayerPipe [default0]: 5: ParallelTransformerLayerPipe [default0]: 6: ParallelTransformerLayerPipe [default0]: 7: ParallelTransformerLayerPipe [default0]:stage=1 layers=6 [default0]: 8: ParallelTransformerLayerPipe [default0]: 9: ParallelTransformerLayerPipe [default0]: 10: ParallelTransformerLayerPipe [default0]: 11: ParallelTransformerLayerPipe [default0]: 12: ParallelTransformerLayerPipe [default0]: 13: ParallelTransformerLayerPipe [default0]:stage=2 layers=6 [default0]: 14: ParallelTransformerLayerPipe [default0]: 15: ParallelTransformerLayerPipe [default0]: 16: ParallelTransformerLayerPipe [default0]: 17: ParallelTransformerLayerPipe [default0]: 18: ParallelTransformerLayerPipe [default0]: 19: ParallelTransformerLayerPipe [default0]:stage=3 layers=6 [default0]: 20: ParallelTransformerLayerPipe [default0]: 21: ParallelTransformerLayerPipe [default0]: 22: ParallelTransformerLayerPipe [default0]: 23: ParallelTransformerLayerPipe [default0]: 24: ParallelTransformerLayerPipe [default0]: 25: ParallelTransformerLayerPipe [default0]:stage=4 layers=6 [default0]: 26: ParallelTransformerLayerPipe [default0]: 27: ParallelTransformerLayerPipe [default0]: 28: ParallelTransformerLayerPipe [default0]: 29: ParallelTransformerLayerPipe [default0]: 30: ParallelTransformerLayerPipe [default0]: 31: ParallelTransformerLayerPipe [default0]:stage=5 layers=6 [default0]: 32: ParallelTransformerLayerPipe [default0]: 33: ParallelTransformerLayerPipe [default0]: 34: ParallelTransformerLayerPipe [default0]: 35: ParallelTransformerLayerPipe [default0]: 36: ParallelTransformerLayerPipe [default0]: 37: ParallelTransformerLayerPipe [default0]:stage=6 layers=6 [default0]: 38: ParallelTransformerLayerPipe [default0]: 39: ParallelTransformerLayerPipe [default0]: 40: ParallelTransformerLayerPipe [default0]: 41: ParallelTransformerLayerPipe [default0]: 42: ParallelTransformerLayerPipe [default0]: 43: ParallelTransformerLayerPipe [default0]:stage=7 layers=6 [default0]: 44: ParallelTransformerLayerPipe [default0]: 45: ParallelTransformerLayerPipe [default0]: 46: ParallelTransformerLayerPipe [default0]: 47: ParallelTransformerLayerPipe [default0]: 48: ParallelTransformerLayerPipe [default0]: 49: ParallelTransformerLayerPipe [default0]:stage=8 layers=6 [default0]: 50: ParallelTransformerLayerPipe [default0]: 51: ParallelTransformerLayerPipe [default0]: 52: ParallelTransformerLayerPipe [default0]: 53: ParallelTransformerLayerPipe [default0]: 54: ParallelTransformerLayerPipe [default0]: 55: ParallelTransformerLayerPipe [default0]:stage=9 layers=6 [default0]: 56: ParallelTransformerLayerPipe [default0]: 57: ParallelTransformerLayerPipe [default0]: 58: ParallelTransformerLayerPipe [default0]: 59: ParallelTransformerLayerPipe [default0]: 60: ParallelTransformerLayerPipe [default0]: 61: ParallelTransformerLayerPipe [default0]:stage=10 layers=6 [default0]: 62: ParallelTransformerLayerPipe [default0]: 63: ParallelTransformerLayerPipe [default0]: 64: ParallelTransformerLayerPipe [default0]: 65: ParallelTransformerLayerPipe [default0]: 66: ParallelTransformerLayerPipe [default0]: 67: ParallelTransformerLayerPipe [default0]:stage=11 layers=9 [default0]: 68: ParallelTransformerLayerPipe [default0]: 69: ParallelTransformerLayerPipe [default0]: 70: ParallelTransformerLayerPipe [default0]: 71: ParallelTransformerLayerPipe [default0]: 72: ParallelTransformerLayerPipe [default0]: 73: <lambda> [default0]: 74: MixedFusedLayerNorm [default0]: 75: EmbeddingPipe [default0]: 76: float16_to_fp32 [default0]: loss: CrossEntropy [default0]:[2022-03-04 04:09:08,974] [INFO] [utils.py:828:see_memory_usage] After Building Model [default0]:[2022-03-04 04:09:08,974] [INFO] [utils.py:829:see_memory_usage] MA 7.43 GB Max_MA 7.43 GB CA 7.45 GB Max_CA 7 GB [default0]:[2022-03-04 04:09:08,975] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.64 GB, percent = 8.7% [default0]:setting training iterations to 128728 [default0]:> learning rate decay style: cosine [default0]:DeepSpeed is enabled. [default0]:[2022-03-04 04:09:08,996] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed info: version=0.6.0+ed26ef4, git-hash=ed26ef4, git-branch=olruwase/bf16-updates [default0]:[2022-03-04 04:09:10,849] [INFO] [engine.py:278:__init__] DeepSpeed Flops Profiler Enabled: False [default0]:[2022-03-04 04:09:10,850] [INFO] [engine.py:1092:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer [default0]:[2022-03-04 04:09:10,850] [INFO] [engine.py:1098:_configure_optimizer] Using client Optimizer as basic optimizer [default0]:[2022-03-04 04:09:10,850] [INFO] [engine.py:1114:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam [default0]:[2022-03-04 04:09:10,850] [INFO] [engine.py:1328:_configure_bf16_optimizer] Creating unfused BF16 optimizer [default0]:[2022-03-04 04:09:10,885] [INFO] [utils.py:828:see_memory_usage] begin bf16_optimizer [default0]:[2022-03-04 04:09:10,886] [INFO] [utils.py:829:see_memory_usage] MA 7.42 GB Max_MA 7.43 GB CA 7.45 GB Max_CA 7 GB [default0]:[2022-03-04 04:09:10,886] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.99 GB, percent = 8.7% [default0]:[2022-03-04 04:09:10,908] [INFO] [utils.py:828:see_memory_usage] before initializing group 0 [default0]:[2022-03-04 04:09:10,909] [INFO] [utils.py:829:see_memory_usage] MA 7.42 GB Max_MA 7.42 GB CA 7.45 GB Max_CA 7 GB [default0]:[2022-03-04 04:09:10,909] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.99 GB, percent = 8.7% [default0]:[2022-03-04 04:09:10,971] [INFO] [utils.py:828:see_memory_usage] after initializing group 0 [default0]:[2022-03-04 04:09:10,971] [INFO] [utils.py:829:see_memory_usage] MA 17.01 GB Max_MA 17.01 GB CA 20.23 GB Max_CA 20 GB [default0]:[2022-03-04 04:09:10,971] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.99 GB, percent = 8.7% [default0]:[2022-03-04 04:09:10,992] [INFO] [utils.py:828:see_memory_usage] before initializing group 1 [default0]:[2022-03-04 04:09:10,993] [INFO] [utils.py:829:see_memory_usage] MA 17.01 GB Max_MA 17.01 GB CA 20.23 GB Max_CA 20 GB [default0]:[2022-03-04 04:09:10,993] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.99 GB, percent = 8.7% [default0]:[2022-03-04 04:09:11,036] [INFO] [utils.py:828:see_memory_usage] after initializing group 1 [default0]:[2022-03-04 04:09:11,036] [INFO] [utils.py:829:see_memory_usage] MA 24.11 GB Max_MA 24.11 GB CA 30.5 GB Max_CA 30 GB [default0]:[2022-03-04 04:09:11,036] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.99 GB, percent = 8.7% [default0]:[2022-03-04 04:09:11,057] [INFO] [utils.py:828:see_memory_usage] before initializing group 2 [default0]:[2022-03-04 04:09:11,058] [INFO] [utils.py:829:see_memory_usage] MA 24.11 GB Max_MA 24.11 GB CA 30.5 GB Max_CA 30 GB [default0]:[2022-03-04 04:09:11,058] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.99 GB, percent = 8.7% [default0]:[2022-03-04 04:09:11,080] [INFO] [utils.py:828:see_memory_usage] after initializing group 2 [default0]:[2022-03-04 04:09:11,080] [INFO] [utils.py:829:see_memory_usage] MA 24.12 GB Max_MA 24.12 GB CA 30.5 GB Max_CA 30 GB [default0]:[2022-03-04 04:09:11,080] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.99 GB, percent = 8.7% [default0]:[2022-03-04 04:09:11,102] [INFO] [utils.py:828:see_memory_usage] before initialize_optimizer [default0]:[2022-03-04 04:09:11,102] [INFO] [utils.py:829:see_memory_usage] MA 24.12 GB Max_MA 24.12 GB CA 30.5 GB Max_CA 30 GB [default0]:[2022-03-04 04:09:11,102] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.99 GB, percent = 8.7% [default0]:[2022-03-04 04:09:11,152] [INFO] [utils.py:828:see_memory_usage] end initialize_optimizer [default0]:[2022-03-04 04:09:11,153] [INFO] [utils.py:829:see_memory_usage] MA 27.82 GB Max_MA 27.82 GB CA 34.21 GB Max_CA 34 GB [default0]:[2022-03-04 04:09:11,153] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.99 GB, percent = 8.7% [default0]:[2022-03-04 04:09:11,175] [INFO] [utils.py:828:see_memory_usage] end bf16_optimizer [default0]:[2022-03-04 04:09:11,175] [INFO] [utils.py:829:see_memory_usage] MA 27.82 GB Max_MA 27.82 GB CA 34.21 GB Max_CA 34 GB [default0]:[2022-03-04 04:09:11,175] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 43.99 GB, percent = 8.7% [default0]:[2022-03-04 04:09:11,175] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam [default0]:[2022-03-04 04:09:11,175] [INFO] [engine.py:795:_configure_lr_scheduler] DeepSpeed using client LR scheduler [default0]:[2022-03-04 04:09:11,175] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed LR Scheduler = <megatron.learning_rates.AnnealingLR object at 0x14850b086250> [default0]:[2022-03-04 04:09:11,176] [INFO] [logging.py:69:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0, 0.0], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)] [default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1057:print] DeepSpeedEngine configuration: [default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print] activation_checkpointing_config { [default0]: "partition_activations": false, [default0]: "contiguous_memory_optimization": false, [default0]: "cpu_checkpointing": false, [default0]: "number_checkpoints": null, [default0]: "synchronize_checkpoint_boundary": false, [default0]: "profile": false [default0]:} [default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print] amp_enabled .................. False [default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print] amp_params ................... False [default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print] autotuning_config ............ { [default0]: "enabled": false, [default0]: "start_step": null, [default0]: "end_step": null, [default0]: "metric_path": null, [default0]: "arg_mappings": null, [default0]: "metric": "throughput", [default0]: "model_info": null, [default0]: "results_dir": null, [default0]: "exps_dir": null, [default0]: "overwrite": true, [default0]: "fast": true, [default0]: "start_profile_step": 3, [default0]: "end_profile_step": 5, [default0]: "tuner_type": "gridsearch", [default0]: "tuner_early_stopping": 5, [default0]: "tuner_num_trials": 50, [default0]: "model_info_path": null, [default0]: "mp_size": 1, [default0]: "max_train_batch_size": null, [default0]: "min_train_batch_size": 1, [default0]: "max_train_micro_batch_size_per_gpu": 1.024000e+03, [default0]: "min_train_micro_batch_size_per_gpu": 1, [default0]: "num_tuning_micro_batch_sizes": 3 [default0]:} [default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print] bfloat16_enabled ............. True [default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print] checkpoint_tag_validation_enabled True [default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print] checkpoint_tag_validation_fail False [default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print] communication_data_type ...... None [default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print] curriculum_enabled ........... False [default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print] curriculum_params ............ False [default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print] dataloader_drop_last ......... False [default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print] disable_allgather ............ False [default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print] dump_state ................... False [default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print] dynamic_loss_scale_args ...... None [default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print] eigenvalue_enabled ........... False [default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print] eigenvalue_gas_boundary_resolution 1 [default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print] eigenvalue_layer_name ........ bert.encoder.layer [default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print] eigenvalue_layer_num ......... 0 [default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print] eigenvalue_max_iter .......... 100 [default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print] eigenvalue_stability ......... 1e-06 [default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print] eigenvalue_tol ............... 0.01 [default0]:[2022-03-04 04:09:11,176] [INFO] [config.py:1061:print] eigenvalue_verbose ........... False [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] elasticity_enabled ........... False [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] flops_profiler_config ........ { [default0]: "enabled": false, [default0]: "profile_step": 1, [default0]: "module_depth": -1, [default0]: "top_modules": 1, [default0]: "detailed": true, [default0]: "output_file": null [default0]:} [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] fp16_enabled ................. False [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] fp16_master_weights_and_gradients False [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] fp16_mixed_quantize .......... False [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] global_rank .................. 0 [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] gradient_accumulation_steps .. 128 [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] gradient_clipping ............ 1.0 [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] gradient_predivide_factor .... 1.0 [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] initial_dynamic_scale ........ 1 [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] loss_scale ................... 1.0 [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] memory_breakdown ............. False [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] optimizer_legacy_fusion ...... False [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] optimizer_name ............... None [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] optimizer_params ............. None [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] pld_enabled .................. False [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] pld_params ................... False [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] prescale_gradients ........... False [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] quantize_change_rate ......... 0.001 [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] quantize_groups .............. 1 [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] quantize_offset .............. 1000 [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] quantize_period .............. 1000 [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] quantize_rounding ............ 0 [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] quantize_start_bits .......... 16 [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] quantize_target_bits ......... 8 [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] quantize_training_enabled .... False [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] quantize_type ................ 0 [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] quantize_verbose ............. False [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] scheduler_name ............... None [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] scheduler_params ............. None [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] sparse_attention ............. None [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] sparse_gradients_enabled ..... False [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] steps_per_print .............. 2000 [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] tensorboard_enabled .......... False [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] tensorboard_job_name ......... DeepSpeedJobName [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] tensorboard_output_path ...... [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] train_batch_size ............. 2048 [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] train_micro_batch_size_per_gpu 2 [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] use_quantizer_kernel ......... False [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] wall_clock_breakdown ......... False [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] world_size ................... 8 [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] zero_allow_untested_optimizer False [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] zero_config .................. { [default0]: "stage": 0, [default0]: "contiguous_gradients": true, [default0]: "reduce_scatter": true, [default0]: "reduce_bucket_size": 5.000000e+08, [default0]: "allgather_partitions": true, [default0]: "allgather_bucket_size": 5.000000e+08, [default0]: "overlap_comm": false, [default0]: "load_from_fp32_weights": true, [default0]: "elastic_checkpoint": false, [default0]: "offload_param": null, [default0]: "offload_optimizer": null, [default0]: "sub_group_size": 1.000000e+09, [default0]: "prefetch_bucket_size": 5.000000e+07, [default0]: "param_persistence_threshold": 1.000000e+05, [default0]: "max_live_parameters": 1.000000e+09, [default0]: "max_reuse_distance": 1.000000e+09, [default0]: "gather_16bit_weights_on_model_save": false, [default0]: "ignore_unused_parameters": true, [default0]: "round_robin_gradients": false, [default0]: "legacy_stage1": false [default0]:} [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] zero_enabled ................. False [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1061:print] zero_optimization_stage ...... 0 [default0]:[2022-03-04 04:09:11,177] [INFO] [config.py:1063:print] json = { [default0]: "train_micro_batch_size_per_gpu": 2, [default0]: "train_batch_size": 2.048000e+03, [default0]: "gradient_clipping": 1.0, [default0]: "zero_optimization": { [default0]: "stage": 0 [default0]: }, [default0]: "bf16": { [default0]: "enabled": true [default0]: }, [default0]: "steps_per_print": 2.000000e+03, [default0]: "wall_clock_breakdown": false [default0]:} [default0]:[2022-03-04 04:09:11,178] [INFO] [engine.py:93:__init__] CONFIG: micro_batches=128 micro_batch_size=2 [default3]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=323 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=321 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=322 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=320 STAGE=10 LAYERS=6 [62, 68) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=128 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=194 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=192 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=193 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=195 STAGE=6 LAYERS=6 [38, 44) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=97 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=99 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=96 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=131 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=129 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=130 STAGE=4 LAYERS=6 [26, 32) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=98 STAGE=3 LAYERS=6 [20, 26) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=224 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=227 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=225 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=353 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=354 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=355 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=352 STAGE=11 LAYERS=9 [68, 77) STAGE_PARAMS=3982580224 (3982.580M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=226 STAGE=7 LAYERS=6 [44, 50) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=66 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=64 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=65 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=67 STAGE=2 LAYERS=6 [14, 20) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=256 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=257 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=259 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=33 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=34 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=32 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=35 STAGE=1 LAYERS=6 [8, 14) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=289 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=2 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=288 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=1 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=3 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=291 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=258 STAGE=8 LAYERS=6 [50, 56) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=290 STAGE=9 LAYERS=6 [56, 62) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-04 04:09:12,467] [INFO] [engine.py:151:__init__] RANK=0 STAGE=0 LAYERS=8 [0, 8) STAGE_PARAMS=3982551552 (3982.552M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=160 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default1]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=161 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default3]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=163 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default2]:[2022-03-04 04:09:12,468] [INFO] [engine.py:151:__init__] RANK=162 STAGE=5 LAYERS=6 [32, 38) STAGE_PARAMS=3700042752 (3700.043M) TOTAL_PARAMS=179862237184 (179862.237M) UNIQUE_PARAMS=176265506816 (176265.507M) [default0]: > using checkpoint value 6e-05 for learning rate [default0]: > using checkpoint value 6e-06 for minimum learning rate [default0]: > using checkpoint value 183105 for warmup iterations [default0]: > using checkpoint value 200000000 for total number of iterations [default0]: > using checkpoint value cosine for decay style [default0]:[2022-03-04 04:09:25,632] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 120 [default0]:[2022-03-04 04:09:26,492] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 120 [default4]:[2022-03-04 04:09:27,414] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 124 [default0]:[2022-03-04 04:09:27,431] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 40 [default2]:[2022-03-04 04:09:28,021] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 122 [default4]:[2022-03-04 04:09:28,258] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 44 [default3]:[2022-03-04 04:09:28,390] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 331 [default0]:[2022-03-04 04:09:28,365] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 40 [default4]:[2022-03-04 04:09:28,445] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 124 [default4]:[2022-03-04 04:09:28,601] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 180 [default0]:[2022-03-04 04:09:28,946] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 272 [default2]:[2022-03-04 04:09:29,017] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 122 [default4]:[2022-03-04 04:09:29,143] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 44 [default0]:[2022-03-04 04:09:29,240] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 48 [default3]:[2022-03-04 04:09:29,340] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 331 [default4]:[2022-03-04 04:09:29,257] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 236 [default4]:[2022-03-04 04:09:29,437] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 332 [default4]:[2022-03-04 04:09:29,554] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 372 [default7]:[2022-03-04 04:09:29,472] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 167 [default4]:[2022-03-04 04:09:29,660] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 252 [default4]:[2022-03-04 04:09:29,671] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 180 [default0]:[2022-03-04 04:09:29,679] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 104 [default6]:[2022-03-04 04:09:29,795] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 278 [default4]:[2022-03-04 04:09:29,833] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 28 [default5]:[2022-03-04 04:09:29,893] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 277 [default7]:[2022-03-04 04:09:29,913] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 335 [default0]:[2022-03-04 04:09:29,898] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 24 [default2]:[2022-03-04 04:09:30,018] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 274 [default0]:[2022-03-04 04:09:30,019] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 272 [default6]:[2022-03-04 04:09:29,989] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 238 [default6]:[2022-03-04 04:09:29,987] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 126 [default4]:[2022-03-04 04:09:30,057] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 276 [default1]:[2022-03-04 04:09:30,056] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 121 [default0]:[2022-03-04 04:09:30,065] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 48 [default1]:[2022-03-04 04:09:30,085] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 233 [default5]:[2022-03-04 04:09:30,119] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 125 [default4]:[2022-03-04 04:09:30,145] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 236 [default3]:[2022-03-04 04:09:30,189] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 123 [default4]:[2022-03-04 04:09:30,149] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 316 [default0]:[2022-03-04 04:09:30,269] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 248 [default0]:[2022-03-04 04:09:30,248] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 352 [default4]:[2022-03-04 04:09:30,340] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 308 [default4]:[2022-03-04 04:09:30,323] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 332 [default5]:[2022-03-04 04:09:30,344] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 237 [default7]:[2022-03-04 04:09:30,322] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 167 [default4]:[2022-03-04 04:09:30,419] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 108 [default1]:[2022-03-04 04:09:30,367] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 329 [default4]:[2022-03-04 04:09:30,393] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 20 [default7]:[2022-03-04 04:09:30,472] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 279 [default6]:[2022-03-04 04:09:30,499] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 334 [default4]:[2022-03-04 04:09:30,509] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 252 [default0]:[2022-03-04 04:09:30,448] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 232 [default2]:[2022-03-04 04:09:30,533] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 42 [default4]:[2022-03-04 04:09:30,521] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 372 [default7]:[2022-03-04 04:09:30,632] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 127 [default0]:[2022-03-04 04:09:30,550] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 104 [default2]:[2022-03-04 04:09:30,558] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 314 [default0]:[2022-03-04 04:09:30,675] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 328 [default4]:[2022-03-04 04:09:30,738] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 284 [default0]:[2022-03-04 04:09:30,715] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 280 [default5]:[2022-03-04 04:09:30,695] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 157 [default4]:[2022-03-04 04:09:30,724] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 164 [default0]:[2022-03-04 04:09:30,672] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 160 [default0]:[2022-03-04 04:09:30,810] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 24 [default3]:[2022-03-04 04:09:30,811] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 163 [default6]:[2022-03-04 04:09:30,783] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 166 [default4]:[2022-03-04 04:09:30,909] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 52 [default5]:[2022-03-04 04:09:30,875] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 277 [default7]:[2022-03-04 04:09:30,859] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 335 [default6]:[2022-03-04 04:09:30,939] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 278 [default6]:[2022-03-04 04:09:30,900] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 238 [default0]:[2022-03-04 04:09:30,868] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 344 [default4]:[2022-03-04 04:09:30,951] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 28 [default2]:[2022-03-04 04:09:30,935] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 162 [default1]:[2022-03-04 04:09:31,043] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 177 [default5]:[2022-03-04 04:09:30,986] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 317 [default4]:[2022-03-04 04:09:31,048] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 316 [default0]:[2022-03-04 04:09:31,135] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 304 [default1]:[2022-03-04 04:09:31,103] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 233 [default1]:[2022-03-04 04:09:31,067] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 273 [default4]:[2022-03-04 04:09:31,086] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 348 [default0]:[2022-03-04 04:09:31,094] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 336 [default6]:[2022-03-04 04:09:31,087] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 30 [default2]:[2022-03-04 04:09:31,152] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 274 [default4]:[2022-03-04 04:09:31,225] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 276 [default4]:[2022-03-04 04:09:31,183] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 308 [default2]:[2022-03-04 04:09:31,242] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 154 [default6]:[2022-03-04 04:09:31,198] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 46 [default0]:[2022-03-04 04:09:31,175] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 368 [default0]:[2022-03-04 04:09:31,334] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 248 [default0]:[2022-03-04 04:09:31,249] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 352 [default2]:[2022-03-04 04:09:31,330] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 234 [default0]:[2022-03-04 04:09:31,315] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 56 [default4]:[2022-03-04 04:09:31,341] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 20 [default5]:[2022-03-04 04:09:31,270] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 237 [default3]:[2022-03-04 04:09:31,406] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 275 [default6]:[2022-03-04 04:09:31,362] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 334 [default4]:[2022-03-04 04:09:31,372] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 108 [default0]:[2022-03-04 04:09:31,358] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 232 [default2]:[2022-03-04 04:09:31,414] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 314 [default1]:[2022-03-04 04:09:31,408] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 329 [default2]:[2022-03-04 04:09:31,404] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 42 [default3]:[2022-03-04 04:09:31,486] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 123 [default4]:[2022-03-04 04:09:31,516] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 196 [default0]:[2022-03-04 04:09:31,458] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 192 [default5]:[2022-03-04 04:09:31,535] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 125 [default5]:[2022-03-04 04:09:31,492] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 333 [default4]:[2022-03-04 04:09:31,494] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 116 [default2]:[2022-03-04 04:09:31,534] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 330 [default5]:[2022-03-04 04:09:31,551] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 157 [default0]:[2022-03-04 04:09:31,537] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 152 [default4]:[2022-03-04 04:09:31,627] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 300 [default1]:[2022-03-04 04:09:31,586] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 121 [default0]:[2022-03-04 04:09:31,622] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 328 [default6]:[2022-03-04 04:09:31,574] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 126 [default0]:[2022-03-04 04:09:31,593] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 168 [default4]:[2022-03-04 04:09:31,611] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 156 [default0]:[2022-03-04 04:09:31,573] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 280 [default1]:[2022-03-04 04:09:31,607] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 161 [default7]:[2022-03-04 04:09:31,644] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 279 [default5]:[2022-03-04 04:09:31,671] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 53 [default6]:[2022-03-04 04:09:31,669] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 62 [default7]:[2022-03-04 04:09:31,665] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 63 [default4]:[2022-03-04 04:09:31,663] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 284 [default3]:[2022-03-04 04:09:31,709] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 171 [default4]:[2022-03-04 04:09:31,810] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 52 [default7]:[2022-03-04 04:09:31,748] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 127 [default5]:[2022-03-04 04:09:31,802] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 109 [default1]:[2022-03-04 04:09:31,762] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 193 [default3]:[2022-03-04 04:09:31,836] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 235 [default2]:[2022-03-04 04:09:31,787] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 74 [default3]:[2022-03-04 04:09:31,798] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 155 [default6]:[2022-03-04 04:09:31,784] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 374 [default1]:[2022-03-04 04:09:31,862] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 313 [default6]:[2022-03-04 04:09:31,863] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 182 [default5]:[2022-03-04 04:09:31,906] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 317 [default3]:[2022-03-04 04:09:31,933] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 315 [default5]:[2022-03-04 04:09:31,872] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 189 [default5]:[2022-03-04 04:09:31,935] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 45 [default4]:[2022-03-04 04:09:31,917] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 148 [default0]:[2022-03-04 04:09:31,911] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 336 [default1]:[2022-03-04 04:09:31,973] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 105 [default1]:[2022-03-04 04:09:31,986] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 177 [default0]:[2022-03-04 04:09:31,958] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 176 [default4]:[2022-03-04 04:09:32,037] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 228 [default1]:[2022-03-04 04:09:32,044] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 273 [default0]:[2022-03-04 04:09:31,999] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 344 [default0]:[2022-03-04 04:09:31,986] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 312 [default0]:[2022-03-04 04:09:31,991] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 160 [default2]:[2022-03-04 04:09:31,971] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 162 [default0]:[2022-03-04 04:09:32,081] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 304 [default2]:[2022-03-04 04:09:32,088] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 178 [default7]:[2022-03-04 04:09:32,133] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 183 [default5]:[2022-03-04 04:09:32,128] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 181 [default7]:[2022-03-04 04:09:32,058] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 175 [default3]:[2022-03-04 04:09:32,152] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 115 [default0]:[2022-03-04 04:09:32,088] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 72 [default6]:[2022-03-04 04:09:32,150] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 46 [default1]:[2022-03-04 04:09:32,110] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 153 [default6]:[2022-03-04 04:09:32,127] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 30 [default2]:[2022-03-04 04:09:32,193] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 234 [default0]:[2022-03-04 04:09:32,197] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 184 [default4]:[2022-03-04 04:09:32,246] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 348 [default2]:[2022-03-04 04:09:32,186] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 154 [default7]:[2022-03-04 04:09:32,232] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 159 [default0]:[2022-03-04 04:09:32,204] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 368 [default2]:[2022-03-04 04:09:32,190] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 26 [default4]:[2022-03-04 04:09:32,237] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 164 [default3]:[2022-03-04 04:09:32,257] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 163 [default6]:[2022-03-04 04:09:32,236] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 166 [default3]:[2022-03-04 04:09:32,312] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 275 [default5]:[2022-03-04 04:09:32,269] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 253 [default4]:[2022-03-04 04:09:32,268] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 76 [default7]:[2022-03-04 04:09:32,266] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 191 [default7]:[2022-03-04 04:09:32,321] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 375 [default0]:[2022-03-04 04:09:32,427] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 192 [default4]:[2022-03-04 04:09:32,406] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 68 [default3]:[2022-03-04 04:09:32,373] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 59 [default1]:[2022-03-04 04:09:32,359] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 185 [default5]:[2022-03-04 04:09:32,374] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 333 [default0]:[2022-03-04 04:09:32,435] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 56 [default2]:[2022-03-04 04:09:32,430] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 266 [default0]:[2022-03-04 04:09:32,434] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 168 [default4]:[2022-03-04 04:09:32,360] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 172 [default7]:[2022-03-04 04:09:32,365] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 47 [default3]:[2022-03-04 04:09:32,407] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 147 [default4]:[2022-03-04 04:09:32,384] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 340 [default3]:[2022-03-04 04:09:32,432] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 27 [default1]:[2022-03-04 04:09:32,530] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 305 [default4]:[2022-03-04 04:09:32,517] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 196 [default0]:[2022-03-04 04:09:32,499] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 128 [default2]:[2022-03-04 04:09:32,532] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 114 [default4]:[2022-03-04 04:09:32,490] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 116 [default2]:[2022-03-04 04:09:32,486] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 58 [default2]:[2022-03-04 04:09:32,501] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 330 [default0]:[2022-03-04 04:09:32,472] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 264 [default3]:[2022-03-04 04:09:32,548] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 43 [default6]:[2022-03-04 04:09:32,490] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 262 [default1]:[2022-03-04 04:09:32,552] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 161 [default7]:[2022-03-04 04:09:32,542] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 31 [default5]:[2022-03-04 04:09:32,585] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 53 [default3]:[2022-03-04 04:09:32,637] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 187 [default1]:[2022-03-04 04:09:32,625] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 57 [default4]:[2022-03-04 04:09:32,636] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 268 [default3]:[2022-03-04 04:09:32,626] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 307 [default5]:[2022-03-04 04:09:32,627] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 309 [default6]:[2022-03-04 04:09:32,634] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 310 [default7]:[2022-03-04 04:09:32,553] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 255 [default7]:[2022-03-04 04:09:32,601] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 239 [default6]:[2022-03-04 04:09:32,597] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 174 [default4]:[2022-03-04 04:09:32,625] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 292 [default6]:[2022-03-04 04:09:32,577] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 118 [default5]:[2022-03-04 04:09:32,638] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 77 [default7]:[2022-03-04 04:09:32,652] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 319 [default3]:[2022-03-04 04:09:32,611] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 171 [default6]:[2022-03-04 04:09:32,674] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 54 [default7]:[2022-03-04 04:09:32,654] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 311 [default5]:[2022-03-04 04:09:32,698] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 109 [default5]:[2022-03-04 04:09:32,734] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 197 [default1]:[2022-03-04 04:09:32,735] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 193 [default3]:[2022-03-04 04:09:32,668] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 235 [default3]:[2022-03-04 04:09:32,681] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 251 [default4]:[2022-03-04 04:09:32,658] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 60 [default2]:[2022-03-04 04:09:32,680] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 306 [default6]:[2022-03-04 04:09:32,747] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 318 [default0]:[2022-03-04 04:09:32,686] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 0 [default0]:[2022-03-04 04:09:32,661] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 32 [default4]:[2022-03-04 04:09:32,665] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 156 [default0]:[2022-03-04 04:09:32,660] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 208 [default0]:[2022-03-04 04:09:32,676] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 152 [default5]:[2022-03-04 04:09:32,729] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 165 [default4]:[2022-03-04 04:09:32,787] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 300 [default1]:[2022-03-04 04:09:32,771] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 297 [default4]:[2022-03-04 04:09:32,841] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 140 [default6]:[2022-03-04 04:09:32,781] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 62 [default2]:[2022-03-04 04:09:32,756] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 50 [default1]:[2022-03-04 04:09:32,831] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 49 [default2]:[2022-03-04 04:09:32,830] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 74 [default5]:[2022-03-04 04:09:32,799] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 45 [default4]:[2022-03-04 04:09:32,832] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 148 [default6]:[2022-03-04 04:09:32,834] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 78 [default4]:[2022-03-04 04:09:32,840] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 380 [default1]:[2022-03-04 04:09:32,834] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 337 [default3]:[2022-03-04 04:09:32,854] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 339 [default0]:[2022-03-04 04:09:32,856] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 216 [default0]:[2022-03-04 04:09:32,932] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 296 [default7]:[2022-03-04 04:09:32,867] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 199 [default2]:[2022-03-04 04:09:32,876] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 106 [default5]:[2022-03-04 04:09:32,918] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 61 [default4]:[2022-03-04 04:09:32,917] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 228 [default1]:[2022-03-04 04:09:32,951] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 169 [default5]:[2022-03-04 04:09:32,864] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 189 [default4]:[2022-03-04 04:09:32,876] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 188 [default3]:[2022-03-04 04:09:32,861] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 155 [default1]:[2022-03-04 04:09:32,900] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 73 [default5]:[2022-03-04 04:09:32,986] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 301 [default1]:[2022-03-04 04:09:32,970] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 105 [default3]:[2022-03-04 04:09:32,984] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 107 [default1]:[2022-03-04 04:09:33,003] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 249 [default7]:[2022-03-04 04:09:33,040] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 63 [default1]:[2022-03-04 04:09:32,963] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 313 [default1]:[2022-03-04 04:09:32,978] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 113 [default3]:[2022-03-04 04:09:32,969] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 315 [default4]:[2022-03-04 04:09:33,036] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 36 [default4]:[2022-03-04 04:09:32,999] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 212 [default1]:[2022-03-04 04:09:33,025] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 153 [default0]:[2022-03-04 04:09:32,975] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 376 [default6]:[2022-03-04 04:09:32,974] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 374 [default3]:[2022-03-04 04:09:33,100] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 179 [default0]:[2022-03-04 04:09:33,070] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 176 [default2]:[2022-03-04 04:09:33,117] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 282 [default0]:[2022-03-04 04:09:33,147] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 112 [default7]:[2022-03-04 04:09:33,103] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 111 [default2]:[2022-03-04 04:09:33,146] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 250 [default0]:[2022-03-04 04:09:33,104] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 72 [default0]:[2022-03-04 04:09:33,084] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 312 [default7]:[2022-03-04 04:09:33,096] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 159 [default7]:[2022-03-04 04:09:33,151] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 55 [default6]:[2022-03-04 04:09:33,167] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 110 [default0]:[2022-03-04 04:09:33,165] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 64 [default6]:[2022-03-04 04:09:33,183] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 270 [default2]:[2022-03-04 04:09:33,211] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 170 [default7]:[2022-03-04 04:09:33,173] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 175 [default3]:[2022-03-04 04:09:33,213] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 291 [default6]:[2022-03-04 04:09:33,242] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 158 [default7]:[2022-03-04 04:09:33,187] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 79 [default0]:[2022-03-04 04:09:33,234] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 144 [default1]:[2022-03-04 04:09:33,193] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 265 [default3]:[2022-03-04 04:09:33,199] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 371 [default3]:[2022-03-04 04:09:33,314] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 91 [default3]:[2022-03-04 04:09:33,339] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 195 [default0]:[2022-03-04 04:09:33,324] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 128 [default5]:[2022-03-04 04:09:33,285] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 253 [default3]:[2022-03-04 04:09:33,266] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 115 [default4]:[2022-03-04 04:09:33,283] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 76 [default7]:[2022-03-04 04:09:33,316] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 47 [default5]:[2022-03-04 04:09:33,337] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 293 [default7]:[2022-03-04 04:09:33,331] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 151 [default4]:[2022-03-04 04:09:33,264] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 340 [default2]:[2022-03-04 04:09:33,358] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 338 [default2]:[2022-03-04 04:09:33,342] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 26 [default2]:[2022-03-04 04:09:33,306] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 370 [default4]:[2022-03-04 04:09:33,349] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 324 [default1]:[2022-03-04 04:09:33,413] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 65 [default3]:[2022-03-04 04:09:33,394] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 59 [default6]:[2022-03-04 04:09:33,356] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 182 [default0]:[2022-03-04 04:09:33,373] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 184 [default5]:[2022-03-04 04:09:33,436] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 349 [default7]:[2022-03-04 04:09:33,438] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 351 [default7]:[2022-03-04 04:09:33,442] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 239 [default5]:[2022-03-04 04:09:33,401] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 173 [default1]:[2022-03-04 04:09:33,405] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 345 [default2]:[2022-03-04 04:09:33,434] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 346 [default6]:[2022-03-04 04:09:33,384] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 350 [default3]:[2022-03-04 04:09:33,365] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 75 [default3]:[2022-03-04 04:09:33,410] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 43 [default3]:[2022-03-04 04:09:33,445] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 147 [default5]:[2022-03-04 04:09:33,404] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 341 [default1]:[2022-03-04 04:09:33,385] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 25 [default1]:[2022-03-04 04:09:33,430] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 369 [default7]:[2022-03-04 04:09:33,380] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 375 [default3]:[2022-03-04 04:09:33,463] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 51 [default2]:[2022-03-04 04:09:33,479] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 178 [default4]:[2022-03-04 04:09:33,490] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 68 [default7]:[2022-03-04 04:09:33,481] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 271 [default7]:[2022-03-04 04:09:33,487] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 191 [default2]:[2022-03-04 04:09:33,553] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 226 [default1]:[2022-03-04 04:09:33,537] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 289 [default3]:[2022-03-04 04:09:33,480] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 27 [default5]:[2022-03-04 04:09:33,545] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 165 [default6]:[2022-03-04 04:09:33,481] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 342 [default6]:[2022-03-04 04:09:33,570] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 198 [default7]:[2022-03-04 04:09:33,642] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 183 [default1]:[2022-03-04 04:09:33,625] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 57 [default2]:[2022-03-04 04:09:33,593] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 58 [default5]:[2022-03-04 04:09:33,640] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 117 [default6]:[2022-03-04 04:09:33,605] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 230 [default4]:[2022-03-04 04:09:33,613] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 244 [default5]:[2022-03-04 04:09:33,634] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 229 [default7]:[2022-03-04 04:09:33,591] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 255 [default2]:[2022-03-04 04:09:33,612] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 266 [default0]:[2022-03-04 04:09:33,564] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 0 [default0]: checkpoint version 3.0 [default7]:[2022-03-04 04:09:33,580] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 23 [default6]:[2022-03-04 04:09:33,633] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 254 [default1]:[2022-03-04 04:09:33,583] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 17 [default0]:[2022-03-04 04:09:33,610] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 32 [default4]:[2022-03-04 04:09:33,574] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 292 [default7]:[2022-03-04 04:09:33,594] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 119 [default2]:[2022-03-04 04:09:33,629] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 258 [default1]:[2022-03-04 04:09:33,649] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 41 [default6]:[2022-03-04 04:09:33,610] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 262 [default0]:[2022-03-04 04:09:33,566] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 256 [default5]:[2022-03-04 04:09:33,606] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 29 [default1]:[2022-03-04 04:09:33,669] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 89 [default5]:[2022-03-04 04:09:33,667] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 93 [default0]:[2022-03-04 04:09:33,651] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 320 [default2]:[2022-03-04 04:09:33,721] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 194 [default6]:[2022-03-04 04:09:33,737] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 70 [default4]:[2022-03-04 04:09:33,738] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 60 [default0]:[2022-03-04 04:09:33,699] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 224 [default2]:[2022-03-04 04:09:33,714] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 114 [default5]:[2022-03-04 04:09:33,653] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 181 [default6]:[2022-03-04 04:09:33,729] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 318 [default6]:[2022-03-04 04:09:33,666] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 174 [default7]:[2022-03-04 04:09:33,735] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 295 [default4]:[2022-03-04 04:09:33,748] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 204 [default0]:[2022-03-04 04:09:33,704] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 208 [default7]:[2022-03-04 04:09:33,720] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 319 [default2]:[2022-03-04 04:09:33,784] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 186 [default3]:[2022-03-04 04:09:33,750] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 131 [default4]:[2022-03-04 04:09:33,795] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 140 [default1]:[2022-03-04 04:09:33,838] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 185 [default2]:[2022-03-04 04:09:33,754] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 306 [default7]:[2022-03-04 04:09:33,828] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 199 [default0]:[2022-03-04 04:09:33,832] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 264 [default4]:[2022-03-04 04:09:33,826] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 172 [default3]:[2022-03-04 04:09:33,767] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 347 [default6]:[2022-03-04 04:09:33,801] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 190 [default6]:[2022-03-04 04:09:33,783] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 118 [default3]:[2022-03-04 04:09:33,830] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 283 [default3]:[2022-03-04 04:09:33,851] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 267 [default1]:[2022-03-04 04:09:33,846] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 337 [default0]:[2022-03-04 04:09:33,775] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 216 [default7]:[2022-03-04 04:09:33,789] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 31 [default6]:[2022-03-04 04:09:33,863] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 54 [default1]:[2022-03-04 04:09:33,929] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 297 [default5]:[2022-03-04 04:09:33,865] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 197 [default4]:[2022-03-04 04:09:33,880] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 132 [default5]:[2022-03-04 04:09:33,900] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 309 [default0]:[2022-03-04 04:09:33,898] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 16 [default1]:[2022-03-04 04:09:33,860] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 49 [default2]:[2022-03-04 04:09:33,860] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 50 [default1]:[2022-03-04 04:09:33,923] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 169 [default4]:[2022-03-04 04:09:33,881] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 380 [default3]:[2022-03-04 04:09:33,893] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 339 [default5]:[2022-03-04 04:09:33,881] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 373 [default3]:[2022-03-04 04:09:34,022] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 67 [default3]:[2022-03-04 04:09:33,957] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 179 [default5]:[2022-03-04 04:09:34,027] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 61 [default2]:[2022-03-04 04:09:33,995] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 282 [default7]:[2022-03-04 04:09:33,991] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 287 [default4]:[2022-03-04 04:09:34,052] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 36 [default6]:[2022-03-04 04:09:33,998] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 214 [default0]:[2022-03-04 04:09:33,979] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 200 [default7]:[2022-03-04 04:09:34,060] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 343 [default0]:[2022-03-04 04:09:34,005] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 376 [default1]:[2022-03-04 04:09:34,117] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 305 [default6]:[2022-03-04 04:09:34,115] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 326 [default0]:[2022-03-04 04:09:34,054] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 96 [default3]:[2022-03-04 04:09:34,049] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 187 [default2]:[2022-03-04 04:09:34,072] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 106 [default1]:[2022-03-04 04:09:34,088] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 129 [default7]:[2022-03-04 04:09:34,098] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 71 [default5]:[2022-03-04 04:09:34,131] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 69 [default6]:[2022-03-04 04:09:34,063] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 286 [default3]:[2022-03-04 04:09:34,138] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 251 [default7]:[2022-03-04 04:09:34,151] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 231 [default5]:[2022-03-04 04:09:34,117] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 21 [default3]:[2022-03-04 04:09:34,117] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 203 [default6]:[2022-03-04 04:09:34,137] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 158 [default5]:[2022-03-04 04:09:34,098] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 77 [default4]:[2022-03-04 04:09:34,108] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 188 [default0]:[2022-03-04 04:09:34,139] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 288 [default7]:[2022-03-04 04:09:34,141] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 207 [default7]:[2022-03-04 04:09:34,072] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 303 [default7]:[2022-03-04 04:09:34,116] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 383 [default7]:[2022-03-04 04:09:34,185] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 55 [default7]:[2022-03-04 04:09:34,185] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 311 [default0]:[2022-03-04 04:09:34,242] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 64 [default3]:[2022-03-04 04:09:34,165] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 307 [default0]:[2022-03-04 04:09:34,168] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 112 [default7]:[2022-03-04 04:09:34,232] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 95 [default5]:[2022-03-04 04:09:34,207] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 269 [default2]:[2022-03-04 04:09:34,236] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 18 [default1]:[2022-03-04 04:09:34,244] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 73 [default4]:[2022-03-04 04:09:34,223] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 212 [default4]:[2022-03-04 04:09:34,226] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 12 [default4]:[2022-03-04 04:09:34,276] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 324 [default0]:[2022-03-04 04:09:34,256] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 296 [default3]:[2022-03-04 04:09:34,247] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 299 [default3]:[2022-03-04 04:09:34,284] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 195 [default2]:[2022-03-04 04:09:34,317] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 66 [default3]:[2022-03-04 04:09:34,288] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 107 [default4]:[2022-03-04 04:09:34,256] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 268 [default1]:[2022-03-04 04:09:34,276] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 113 [default1]:[2022-03-04 04:09:34,273] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 225 [default6]:[2022-03-04 04:09:34,258] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 310 [default5]:[2022-03-04 04:09:34,333] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 285 [default1]:[2022-03-04 04:09:34,316] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 281 [default3]:[2022-03-04 04:09:34,266] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 291 [default6]:[2022-03-04 04:09:34,279] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 22 [default3]:[2022-03-04 04:09:34,293] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 75 [default6]:[2022-03-04 04:09:34,329] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 78 [default3]:[2022-03-04 04:09:34,292] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 379 [default5]:[2022-03-04 04:09:34,320] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 341 [default3]:[2022-03-04 04:09:34,360] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 51 [default5]:[2022-03-04 04:09:34,402] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 301 [default5]:[2022-03-04 04:09:34,348] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 133 [default1]:[2022-03-04 04:09:34,392] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 241 [default5]:[2022-03-04 04:09:34,408] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 245 [default1]:[2022-03-04 04:09:34,419] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 249 [default3]:[2022-03-04 04:09:34,393] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 227 [default2]:[2022-03-04 04:09:34,419] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 170 [default3]:[2022-03-04 04:09:34,355] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 19 [default7]:[2022-03-04 04:09:34,425] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 119 [default5]:[2022-03-04 04:09:34,386] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 293 [default1]:[2022-03-04 04:09:34,447] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 41 [default1]:[2022-03-04 04:09:34,386] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 25 [default2]:[2022-03-04 04:09:34,517] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 250 [default7]:[2022-03-04 04:09:34,466] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 79 [default0]:[2022-03-04 04:09:34,488] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 144 [default5]:[2022-03-04 04:09:34,520] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 205 [default2]:[2022-03-04 04:09:34,508] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 338 [default5]:[2022-03-04 04:09:34,507] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 29 [default6]:[2022-03-04 04:09:34,565] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 342 [default3]:[2022-03-04 04:09:34,587] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 91 [default0]:[2022-03-04 04:09:34,564] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 88 [default6]:[2022-03-04 04:09:34,623] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 110 [default6]:[2022-03-04 04:09:34,615] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 198 [default1]:[2022-03-04 04:09:34,590] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 65 [default0]:[2022-03-04 04:09:34,628] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 136 [default5]:[2022-03-04 04:09:34,571] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 117 [default7]:[2022-03-04 04:09:34,610] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 111 [default4]:[2022-03-04 04:09:34,603] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 84 [default5]:[2022-03-04 04:09:34,629] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 37 [default2]:[2022-03-04 04:09:34,567] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 34 [default5]:[2022-03-04 04:09:34,635] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 173 [default0]:[2022-03-04 04:09:34,606] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 240 [default2]:[2022-03-04 04:09:34,604] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 226 [default1]:[2022-03-04 04:09:34,625] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 289 [default5]:[2022-03-04 04:09:34,619] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 261 [default7]:[2022-03-04 04:09:34,621] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 151 [default6]:[2022-03-04 04:09:34,656] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 150 [default7]:[2022-03-04 04:09:34,637] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 263 [default1]:[2022-03-04 04:09:34,577] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 257 [default2]:[2022-03-04 04:09:34,578] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 370 [default3]:[2022-03-04 04:09:34,567] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 371 [default5]:[2022-03-04 04:09:34,650] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 13 [default5]:[2022-03-04 04:09:34,571] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 221 [default0]:[2022-03-04 04:09:34,708] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 320 [default2]:[2022-03-04 04:09:34,715] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 194 [default2]:[2022-03-04 04:09:34,748] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 298 [default4]:[2022-03-04 04:09:34,701] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 100 [default4]:[2022-03-04 04:09:34,739] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 244 [default5]:[2022-03-04 04:09:34,720] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 85 [default7]:[2022-03-04 04:09:34,673] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 23 [default6]:[2022-03-04 04:09:34,667] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 294 [default6]:[2022-03-04 04:09:34,751] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 350 [default1]:[2022-03-04 04:09:34,691] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 201 [default2]:[2022-03-04 04:09:34,679] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 258 [default0]:[2022-03-04 04:09:34,709] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 256 [default2]:[2022-03-04 04:09:34,672] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 290 [default4]:[2022-03-04 04:09:34,836] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 92 [default2]:[2022-03-04 04:09:34,776] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 186 [default6]:[2022-03-04 04:09:34,822] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 230 [default5]:[2022-03-04 04:09:34,772] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 229 [default7]:[2022-03-04 04:09:34,762] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 271 [default4]:[2022-03-04 04:09:34,798] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 204 [default6]:[2022-03-04 04:09:34,828] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 190 [default3]:[2022-03-04 04:09:34,834] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 283 [default6]:[2022-03-04 04:09:34,846] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 382 [default1]:[2022-03-04 04:09:34,780] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 369 [default0]:[2022-03-04 04:09:34,899] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 80 [default6]:[2022-03-04 04:09:34,895] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 70 [default4]:[2022-03-04 04:09:34,904] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 364 [default0]:[2022-03-04 04:09:34,895] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 224 [default6]:[2022-03-04 04:09:34,899] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 270 [default6]:[2022-03-04 04:09:34,859] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 254 [default4]:[2022-03-04 04:09:34,896] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 356 [default7]:[2022-03-04 04:09:34,945] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 295 [default2]:[2022-03-04 04:09:34,922] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 202 [default2]:[2022-03-04 04:09:34,882] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 378 [default5]:[2022-03-04 04:09:34,926] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 381 [default7]:[2022-03-04 04:09:34,884] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 343 [default4]:[2022-03-04 04:09:34,927] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 260 [default1]:[2022-03-04 04:09:34,912] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 217 [default2]:[2022-03-04 04:09:34,962] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 10 [default1]:[2022-03-04 04:09:35,013] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 89 [default2]:[2022-03-04 04:09:35,041] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 90 [default0]:[2022-03-04 04:09:34,959] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 96 [default7]:[2022-03-04 04:09:35,041] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 71 [default3]:[2022-03-04 04:09:34,967] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 139 [default6]:[2022-03-04 04:09:35,024] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 38 [default6]:[2022-03-04 04:09:35,003] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 214 [default6]:[2022-03-04 04:09:35,020] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 206 [default3]:[2022-03-04 04:09:35,023] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 267 [default1]:[2022-03-04 04:09:34,973] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 265 [default7]:[2022-03-04 04:09:34,995] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 303 [default1]:[2022-03-04 04:09:35,016] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 377 [default5]:[2022-03-04 04:09:35,121] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 93 [default3]:[2022-03-04 04:09:35,139] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 299 [default6]:[2022-03-04 04:09:35,139] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 326 [default3]:[2022-03-04 04:09:35,091] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 67 [default5]:[2022-03-04 04:09:35,112] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 69 [default5]:[2022-03-04 04:09:35,059] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 269 [default3]:[2022-03-04 04:09:35,146] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 35 [default1]:[2022-03-04 04:09:35,125] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 345 [default1]:[2022-03-04 04:09:35,131] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 33 [default3]:[2022-03-04 04:09:35,124] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 347 [default0]:[2022-03-04 04:09:35,138] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 288 [default3]:[2022-03-04 04:09:35,156] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 211 [default5]:[2022-03-04 04:09:35,142] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 149 [default6]:[2022-03-04 04:09:35,158] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 222 [default0]:[2022-03-04 04:09:35,103] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 200 [default5]:[2022-03-04 04:09:35,101] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 373 [default7]:[2022-03-04 04:09:35,164] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 383 [default6]:[2022-03-04 04:09:35,219] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 94 [default2]:[2022-03-04 04:09:35,208] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 322 [default1]:[2022-03-04 04:09:35,202] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 97 [default6]:[2022-03-04 04:09:35,178] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 366 [default3]:[2022-03-04 04:09:35,180] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 363 [default7]:[2022-03-04 04:09:35,246] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 231 [default7]:[2022-03-04 04:09:35,178] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 87 [default3]:[2022-03-04 04:09:35,185] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 83 [default6]:[2022-03-04 04:09:35,201] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 86 [default7]:[2022-03-04 04:09:35,159] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 39 [default0]:[2022-03-04 04:09:35,227] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 16 [default5]:[2022-03-04 04:09:35,189] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 285 [default1]:[2022-03-04 04:09:35,199] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 17 [default1]:[2022-03-04 04:09:35,196] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 81 [default1]:[2022-03-04 04:09:35,173] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 145 [default2]:[2022-03-04 04:09:35,174] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 146 [default7]:[2022-03-04 04:09:35,162] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 207 [default0]:[2022-03-04 04:09:35,187] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 8 [default4]:[2022-03-04 04:09:35,181] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 220 [default3]:[2022-03-04 04:09:35,191] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 11 [default2]:[2022-03-04 04:09:35,323] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 66 [default6]:[2022-03-04 04:09:35,321] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 286 [default3]:[2022-03-04 04:09:35,277] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 227 [default6]:[2022-03-04 04:09:35,297] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 302 [default4]:[2022-03-04 04:09:35,299] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 4 [default7]:[2022-03-04 04:09:35,349] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 351 [default7]:[2022-03-04 04:09:35,333] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 95 [default7]:[2022-03-04 04:09:35,257] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 287 [default1]:[2022-03-04 04:09:35,299] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 281 [default2]:[2022-03-04 04:09:35,304] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 346 [default2]:[2022-03-04 04:09:35,348] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 18 [default3]:[2022-03-04 04:09:35,271] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 203 [default3]:[2022-03-04 04:09:35,294] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 259 [default3]:[2022-03-04 04:09:35,355] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 379 [default7]:[2022-03-04 04:09:35,343] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 15 [default4]:[2022-03-04 04:09:35,362] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 12 [default0]:[2022-03-04 04:09:35,447] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 88 [default2]:[2022-03-04 04:09:35,393] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 82 [default6]:[2022-03-04 04:09:35,444] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 102 [default3]:[2022-03-04 04:09:35,439] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 131 [default3]:[2022-03-04 04:09:35,423] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 243 [default5]:[2022-03-04 04:09:35,360] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 349 [default7]:[2022-03-04 04:09:35,457] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 215 [default7]:[2022-03-04 04:09:35,525] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 327 [default7]:[2022-03-04 04:09:35,494] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 103 [default0]:[2022-03-04 04:09:35,487] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 136 [default1]:[2022-03-04 04:09:35,454] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 225 [default5]:[2022-03-04 04:09:35,478] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 21 [default6]:[2022-03-04 04:09:35,518] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 294 [default6]:[2022-03-04 04:09:35,550] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 22 [default2]:[2022-03-04 04:09:35,503] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 290 [default7]:[2022-03-04 04:09:35,601] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 135 [default5]:[2022-03-04 04:09:35,563] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 325 [default3]:[2022-03-04 04:09:35,581] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 99 [default5]:[2022-03-04 04:09:35,592] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 101 [default5]:[2022-03-04 04:09:35,613] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 245 [default5]:[2022-03-04 04:09:35,619] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 365 [default4]:[2022-03-04 04:09:35,644] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 100 [default6]:[2022-03-04 04:09:35,629] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 246 [default5]:[2022-03-04 04:09:35,633] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 37 [default2]:[2022-03-04 04:09:35,625] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 34 [default3]:[2022-03-04 04:09:35,615] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 19 [default1]:[2022-03-04 04:09:35,589] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 209 [default6]:[2022-03-04 04:09:35,584] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 150 [default2]:[2022-03-04 04:09:35,573] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 218 [default2]:[2022-03-04 04:09:35,615] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 210 [default1]:[2022-03-04 04:09:35,728] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 241 [default1]:[2022-03-04 04:09:35,729] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 129 [default2]:[2022-03-04 04:09:35,658] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 298 [default7]:[2022-03-04 04:09:35,702] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 367 [default2]:[2022-03-04 04:09:35,687] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 242 [default7]:[2022-03-04 04:09:35,692] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 247 [default5]:[2022-03-04 04:09:35,713] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 205 [default1]:[2022-03-04 04:09:35,748] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 257 [default5]:[2022-03-04 04:09:35,763] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 221 [default7]:[2022-03-04 04:09:35,748] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 223 [default4]:[2022-03-04 04:09:35,832] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 92 [default0]:[2022-03-04 04:09:35,770] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 360 [default6]:[2022-03-04 04:09:35,849] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 134 [default2]:[2022-03-04 04:09:35,755] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 98 [default0]:[2022-03-04 04:09:35,786] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 240 [default1]:[2022-03-04 04:09:35,855] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 9 [default6]:[2022-03-04 04:09:35,822] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 14 [default5]:[2022-03-04 04:09:35,774] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 13 [default2]:[2022-03-04 04:09:35,862] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 130 [default1]:[2022-03-04 04:09:35,947] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 321 [default3]:[2022-03-04 04:09:35,880] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 323 [default5]:[2022-03-04 04:09:35,865] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 85 [default3]:[2022-03-04 04:09:35,889] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 139 [default1]:[2022-03-04 04:09:35,945] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 201 [default2]:[2022-03-04 04:09:35,900] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 202 [default5]:[2022-03-04 04:09:35,913] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 213 [default2]:[2022-03-04 04:09:35,918] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 378 [default7]:[2022-03-04 04:09:35,960] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 263 [default3]:[2022-03-04 04:09:35,936] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 219 [default1]:[2022-03-04 04:09:36,013] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 137 [default4]:[2022-03-04 04:09:36,040] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 132 [default5]:[2022-03-04 04:09:36,018] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 141 [default2]:[2022-03-04 04:09:36,024] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 138 [default2]:[2022-03-04 04:09:36,008] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 146 [default6]:[2022-03-04 04:09:36,034] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 358 [default6]:[2022-03-04 04:09:36,044] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 222 [default2]:[2022-03-04 04:09:36,052] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 90 [default2]:[2022-03-04 04:09:36,137] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 322 [default6]:[2022-03-04 04:09:36,110] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 142 [default4]:[2022-03-04 04:09:36,081] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 364 [default6]:[2022-03-04 04:09:36,144] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 302 [default3]:[2022-03-04 04:09:36,061] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 3 [default2]:[2022-03-04 04:09:36,096] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 2 [default4]:[2022-03-04 04:09:36,117] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 84 [default1]:[2022-03-04 04:09:36,070] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 33 [default5]:[2022-03-04 04:09:36,113] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 261 [default1]:[2022-03-04 04:09:36,086] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 145 [default5]:[2022-03-04 04:09:36,077] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 149 [default6]:[2022-03-04 04:09:36,078] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 206 [default1]:[2022-03-04 04:09:36,085] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 217 [default7]:[2022-03-04 04:09:36,115] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 359 [default1]:[2022-03-04 04:09:36,116] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 377 [default5]:[2022-03-04 04:09:36,175] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 133 [default3]:[2022-03-04 04:09:36,186] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 355 [default1]:[2022-03-04 04:09:36,203] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 361 [default2]:[2022-03-04 04:09:36,221] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 362 [default7]:[2022-03-04 04:09:36,214] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 143 [default4]:[2022-03-04 04:09:36,227] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 356 [default1]:[2022-03-04 04:09:36,224] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 81 [default3]:[2022-03-04 04:09:36,189] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 211 [default6]:[2022-03-04 04:09:36,191] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 6 [default6]:[2022-03-04 04:09:36,261] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 94 [default1]:[2022-03-04 04:09:36,289] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 97 [default2]:[2022-03-04 04:09:36,321] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 354 [default3]:[2022-03-04 04:09:36,346] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 363 [default6]:[2022-03-04 04:09:36,316] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 382 [default4]:[2022-03-04 04:09:36,351] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 260 [default0]:[2022-03-04 04:09:36,305] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 8 [default4]:[2022-03-04 04:09:36,311] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 220 [default7]:[2022-03-04 04:09:36,301] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 7 [default2]:[2022-03-04 04:09:36,325] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 10 [default0]:[2022-03-04 04:09:36,412] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 80 [default1]:[2022-03-04 04:09:36,374] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 1 [default5]:[2022-03-04 04:09:36,380] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 381 [default6]:[2022-03-04 04:09:36,547] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 366 [default4]:[2022-03-04 04:09:36,510] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 4 [default3]:[2022-03-04 04:09:36,465] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 35 [default6]:[2022-03-04 04:09:36,532] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 38 [default5]:[2022-03-04 04:09:36,518] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 357 [default7]:[2022-03-04 04:09:36,516] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 215 [default3]:[2022-03-04 04:09:36,543] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 11 [default3]:[2022-03-04 04:09:36,641] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 243 [default5]:[2022-03-04 04:09:36,601] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 5 [default7]:[2022-03-04 04:09:36,584] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 39 [default1]:[2022-03-04 04:09:36,575] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 209 [default2]:[2022-03-04 04:09:36,655] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 210 [default7]:[2022-03-04 04:09:36,644] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 223 [default5]:[2022-03-04 04:09:36,651] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 101 [default1]:[2022-03-04 04:09:36,725] [INFO] [engine.py:2738:_get_all_zero_checkpoints] successfully read 8 ZeRO state_dicts for rank 353 [default5]:[2022-03-04 04:09:36,740] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 365 [default3]:[2022-03-04 04:09:36,744] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 259 [default7]:[2022-03-04 04:09:36,805] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 135 [default5]:[2022-03-04 04:09:36,838] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 325 [default6]:[2022-03-04 04:09:36,842] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 102 [default7]:[2022-03-04 04:09:36,810] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 327 [default5]:[2022-03-04 04:09:36,844] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 213 [default7]:[2022-03-04 04:09:36,771] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 15 [default6]:[2022-03-04 04:09:36,866] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 14 [default7]:[2022-03-04 04:09:36,919] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 367 [default2]:[2022-03-04 04:09:36,858] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 98 [default3]:[2022-03-04 04:09:36,945] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 83 [default1]:[2022-03-04 04:09:36,943] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 9 [default2]:[2022-03-04 04:09:36,910] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 218 [default2]:[2022-03-04 04:09:36,969] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 130 [default7]:[2022-03-04 04:09:37,030] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 103 [default3]:[2022-03-04 04:09:37,043] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 99 [default2]:[2022-03-04 04:09:36,961] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 242 [default2]:[2022-03-04 04:09:37,030] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 138 [default1]:[2022-03-04 04:09:37,111] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 321 [default3]:[2022-03-04 04:09:37,105] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 323 [default7]:[2022-03-04 04:09:37,136] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 247 [default6]:[2022-03-04 04:09:37,128] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 246 [default5]:[2022-03-04 04:09:37,221] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 141 [default0]:[2022-03-04 04:09:37,158] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 360 [default6]:[2022-03-04 04:09:37,196] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 134 [default2]:[2022-03-04 04:09:37,299] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 82 [default1]:[2022-03-04 04:09:37,302] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 361 [default2]:[2022-03-04 04:09:37,323] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 362 [default7]:[2022-03-04 04:09:37,302] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 87 [default6]:[2022-03-04 04:09:37,279] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 86 [default3]:[2022-03-04 04:09:37,305] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 219 [default1]:[2022-03-04 04:09:37,358] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 137 [default3]:[2022-03-04 04:09:37,372] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 3 [default6]:[2022-03-04 04:09:37,392] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 358 [default3]:[2022-03-04 04:09:37,536] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 355 [default6]:[2022-03-04 04:09:37,486] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 142 [default7]:[2022-03-04 04:09:37,503] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 143 [default1]:[2022-03-04 04:09:37,611] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 1 [default5]:[2022-03-04 04:09:37,572] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 357 [default6]:[2022-03-04 04:09:37,654] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 6 [default7]:[2022-03-04 04:09:37,632] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 359 [default2]:[2022-03-04 04:09:37,746] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 354 [default2]:[2022-03-04 04:09:37,714] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 2 [default1]:[2022-03-04 04:09:37,794] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 353 [default0]: successfully loaded checkpoint from /gpfsscratch/rech/six/commun/checkpoints/tr11-176B-ml/checkpoints at iteration 4704 [default0]:estimated model parameters: 191.162474496 [default0]:estimated model parameters without embeddings: 148.003086336 [default0]:[after model, optimizer, and learning rate scheduler are built] datetime: 2022-03-04 04:09:37 [default0]:> building train, validation, and test datasets ... [default0]: > datasets target sizes (minimum size): [default0]: train: 220000000 [default0]: validation: 2641920 [default0]: test: 20480 [default0]:> building train, validation, and test datasets for GPT ... [default0]: > building dataset index ... [default5]:[2022-03-04 04:09:37,876] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 5 [default0]:/gpfsssd/worksf/projects/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed/megatron/utils.py:280: UserWarning: Parameter count with the embeddings will be inaccurate with PP > 1, as the first and last stage hold several copies of the embeddings [default0]: warnings.warn("Parameter count with the embeddings will be inaccurate with PP > 1, as the first and last stage hold several copies of the embeddings") [default7]:[2022-03-04 04:09:37,903] [INFO] [engine.py:2668:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 7 [default7]:time (ms) | load-checkpoint: 24741.96 [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.110049 seconds [default0]: number of documents: 1276214 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 1211127) total of 1211127 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_19250640ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_19250640ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_19250640ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.147 seconds [default0]: total number of samples: 19333818 [default0]: total number of epochs: 41 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.007279 seconds [default0]: number of documents: 2218089 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 2104966) total of 2104966 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_4583714ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_4583714ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_4583714ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.172 seconds [default0]: total number of samples: 4602461 [default0]: total number of epochs: 22 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.013273 seconds [default0]: number of documents: 14716427 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 13965889) total of 13965889 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.044 seconds [default0]: total number of samples: 35728792 [default0]: total number of epochs: 4 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.015255 seconds [default0]: number of documents: 2767535 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 2626391) total of 2626391 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27456618ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27456618ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27456618ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.056 seconds [default0]: total number of samples: 28139393 [default0]: total number of epochs: 28 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.005037 seconds [default0]: number of documents: 786245 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 746147) total of 746147 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_642209ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_642209ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_642209ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.029 seconds [default0]: total number of samples: 670404 [default0]: total number of epochs: 22 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.015366 seconds [default0]: number of documents: 1748556 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 1659380) total of 1659380 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.134 seconds [default0]: total number of samples: 27952020 [default0]: total number of epochs: 56 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002278 seconds [default0]: number of documents: 29464287 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 27961608) total of 27961608 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_14576562ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_14576562ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_14576562ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.046 seconds [default0]: total number of samples: 14638800 [default0]: total number of epochs: 42 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.005770 seconds [default0]: number of documents: 38304059 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 36350552) total of 36350552 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_26739945ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_26739945ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_26739945ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.048 seconds [default0]: total number of samples: 27308815 [default0]: total number of epochs: 46 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.005162 seconds [default0]: number of documents: 729667 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 692454) total of 692454 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_6868800ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_6868800ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_6868800ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.100 seconds [default0]: total number of samples: 6887421 [default0]: total number of epochs: 22 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002614 seconds [default0]: number of documents: 24265522 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 23027980) total of 23027980 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_10051887ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_10051887ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_10051887ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.053 seconds [default0]: total number of samples: 10304343 [default0]: total number of epochs: 25 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.014178 seconds [default0]: number of documents: 9587455 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 9098495) total of 9098495 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_28093835ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_28093835ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_28093835ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.059 seconds [default0]: total number of samples: 28924755 [default0]: total number of epochs: 10 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.005628 seconds [default0]: number of documents: 4335929 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 4114797) total of 4114797 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_27571073ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.044 seconds [default0]: total number of samples: 29929866 [default0]: total number of epochs: 11 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.001051 seconds [default0]: number of documents: 149731 [default0]: > dataset split: [default0]: train: [default0]: document indices in [0, 142095) total of 142095 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_122580ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_122580ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_train_indexmap_122580ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.013 seconds [default0]: total number of samples: 127855 [default0]: total number of epochs: 18 [default0]:> building indices for blendable datasets ... [default0]: > sample ratios: [default0]: dataset 0, input: 0.0870676, achieved: 0.0870676 [default0]: dataset 1, input: 0.0207314, achieved: 0.0207314 [default0]: dataset 2, input: 0.1247, achieved: 0.1247 [default0]: dataset 3, input: 0.124182, achieved: 0.124182 [default0]: dataset 4, input: 0.0029046, achieved: 0.0029046 [default0]: dataset 5, input: 0.1247, achieved: 0.1247 [default0]: dataset 6, input: 0.0659275, achieved: 0.0659275 [default0]: dataset 7, input: 0.120941, achieved: 0.120941 [default0]: dataset 8, input: 0.0310665, achieved: 0.0310665 [default0]: dataset 9, input: 0.0454631, achieved: 0.0454631 [default0]: dataset 10, input: 0.127064, achieved: 0.127064 [default0]: dataset 11, input: 0.1247, achieved: 0.1247 [default0]: dataset 12, input: 0.000554406, achieved: 0.000554405 [default0]:> elapsed time for building blendable dataset indices: 4.26 (sec) [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002351 seconds [default0]: number of documents: 1276214 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [1211127, 1274938) total of 63811 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_231176ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_231176ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_231176ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.009 seconds [default0]: total number of samples: 241146 [default0]: total number of epochs: 18 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002175 seconds [default0]: number of documents: 2218089 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [2104966, 2215871) total of 110905 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_55045ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_55045ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_55045ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.008 seconds [default0]: total number of samples: 55872 [default0]: total number of epochs: 5 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.011774 seconds [default0]: number of documents: 14716427 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [13965889, 14701711) total of 735822 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.023 seconds [default0]: total number of samples: 1880535 [default0]: total number of epochs: 1 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002515 seconds [default0]: number of documents: 2767535 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [2626391, 2764767) total of 138376 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_329720ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_329720ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_329720ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.009 seconds [default0]: total number of samples: 480297 [default0]: total number of epochs: 2 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.009166 seconds [default0]: number of documents: 786245 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [746147, 785459) total of 39312 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_7713ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_7713ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_7713ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.006 seconds [default0]: total number of samples: 8487 [default0]: total number of epochs: 8 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002463 seconds [default0]: number of documents: 1748556 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [1659380, 1746807) total of 87427 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.027 seconds [default0]: total number of samples: 907157 [default0]: total number of epochs: 1 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.015043 seconds [default0]: number of documents: 29464287 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [27961608, 29434823) total of 1473215 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_175046ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_175046ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_175046ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.019 seconds [default0]: total number of samples: 186675 [default0]: total number of epochs: 12 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.007638 seconds [default0]: number of documents: 38304059 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [36350552, 38265755) total of 1915203 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_321113ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_321113ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_321113ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.099 seconds [default0]: total number of samples: 333733 [default0]: total number of epochs: 13 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002091 seconds [default0]: number of documents: 729667 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [692454, 728937) total of 36483 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_82486ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_82486ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_82486ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.006 seconds [default0]: total number of samples: 98264 [default0]: total number of epochs: 5 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.009394 seconds [default0]: number of documents: 24265522 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [23027980, 24241256) total of 1213276 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_120711ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_120711ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_120711ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.020 seconds [default0]: total number of samples: 129080 [default0]: total number of epochs: 6 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.007646 seconds [default0]: number of documents: 9587455 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [9098495, 9577868) total of 479373 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_337372ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_337372ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_337372ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.012 seconds [default0]: total number of samples: 469042 [default0]: total number of epochs: 3 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.006787 seconds [default0]: number of documents: 4335929 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [4114797, 4331593) total of 216796 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_331094ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.016 seconds [default0]: total number of samples: 398209 [default0]: total number of epochs: 2 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.000600 seconds [default0]: number of documents: 149731 [default0]: > dataset split: [default0]: valid: [default0]: document indices in [142095, 149581) total of 7486 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_1473ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_1473ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_valid_indexmap_1473ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.004 seconds [default0]: total number of samples: 1544 [default0]: total number of epochs: 6 [default0]:> building indices for blendable datasets ... [default0]: > sample ratios: [default0]: dataset 0, input: 0.0870676, achieved: 0.0870675 [default0]: dataset 1, input: 0.0207314, achieved: 0.0207315 [default0]: dataset 2, input: 0.1247, achieved: 0.1247 [default0]: dataset 3, input: 0.124182, achieved: 0.124182 [default0]: dataset 4, input: 0.0029046, achieved: 0.00290461 [default0]: dataset 5, input: 0.1247, achieved: 0.1247 [default0]: dataset 6, input: 0.0659275, achieved: 0.0659274 [default0]: dataset 7, input: 0.120941, achieved: 0.120941 [default0]: dataset 8, input: 0.0310665, achieved: 0.0310665 [default0]: dataset 9, input: 0.0454631, achieved: 0.0454631 [default0]: dataset 10, input: 0.127064, achieved: 0.127064 [default0]: dataset 11, input: 0.1247, achieved: 0.1247 [default0]: dataset 12, input: 0.000554406, achieved: 0.000554525 [default0]:> elapsed time for building blendable dataset indices: 0.09 (sec) [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002387 seconds [default0]: number of documents: 1276214 [default0]: > dataset split: [default0]: test: [default0]: document indices in [1274938, 1276214) total of 1276 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1793ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1793ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ar/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1793ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.010 seconds [default0]: total number of samples: 202915 [default0]: total number of epochs: 1 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002280 seconds [default0]: number of documents: 2218089 [default0]: > dataset split: [default0]: test: [default0]: document indices in [2215871, 2218089) total of 2218 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_427ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_427ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/ca/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_427ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.004 seconds [default0]: total number of samples: 459 [default0]: total number of epochs: 13 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002061 seconds [default0]: number of documents: 14716427 [default0]: > dataset split: [default0]: test: [default0]: document indices in [14701711, 14716427) total of 14716 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/en/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.004 seconds [default0]: total number of samples: 37487 [default0]: total number of epochs: 1 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.001854 seconds [default0]: number of documents: 2767535 [default0]: > dataset split: [default0]: test: [default0]: document indices in [2764767, 2767535) total of 2768 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2556ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2556ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/es/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2556ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.003 seconds [default0]: total number of samples: 9926 [default0]: total number of epochs: 1 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.006601 seconds [default0]: number of documents: 786245 [default0]: > dataset split: [default0]: test: [default0]: document indices in [785459, 786245) total of 786 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_60ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_60ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/eu/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_60ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.002 seconds [default0]: total number of samples: 79 [default0]: total number of epochs: 4 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.001729 seconds [default0]: number of documents: 1748556 [default0]: > dataset split: [default0]: test: [default0]: document indices in [1746807, 1748556) total of 1749 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/fr/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.004 seconds [default0]: total number of samples: 34096 [default0]: total number of epochs: 1 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.001585 seconds [default0]: number of documents: 29464287 [default0]: > dataset split: [default0]: test: [default0]: document indices in [29434823, 29464287) total of 29464 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1357ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1357ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/id/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_1357ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.004 seconds [default0]: total number of samples: 1645 [default0]: total number of epochs: 5 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.007080 seconds [default0]: number of documents: 38304059 [default0]: > dataset split: [default0]: test: [default0]: document indices in [38265755, 38304059) total of 38304 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2490ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2490ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/indic/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2490ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.004 seconds [default0]: total number of samples: 2778 [default0]: total number of epochs: 5 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.002180 seconds [default0]: number of documents: 729667 [default0]: > dataset split: [default0]: test: [default0]: document indices in [728937, 729667) total of 730 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_640ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_640ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/pt/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_640ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.008 seconds [default0]: total number of samples: 716 [default0]: total number of epochs: 2 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.001711 seconds [default0]: number of documents: 24265522 [default0]: > dataset split: [default0]: test: [default0]: document indices in [24241256, 24265522) total of 24266 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_936ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_936ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/vi/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_936ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.003 seconds [default0]: total number of samples: 1312 [default0]: total number of epochs: 3 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.001763 seconds [default0]: number of documents: 9587455 [default0]: > dataset split: [default0]: test: [default0]: document indices in [9577868, 9587455) total of 9587 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2616ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2616ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/zh/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2616ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.004 seconds [default0]: total number of samples: 3324 [default0]: total number of epochs: 2 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.001737 seconds [default0]: number of documents: 4335929 [default0]: > dataset split: [default0]: test: [default0]: document indices in [4331593, 4335929) total of 4336 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/code/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_2567ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.004 seconds [default0]: total number of samples: 3964 [default0]: total number of epochs: 1 [default0]: > building dataset index ... [default0]: reading sizes... [default0]: reading pointers... [default0]: reading document index... [default0]: creating numpy buffer of mmap... [default0]: creating memory view of numpy buffer... [default0]: > finished creating indexed dataset in 0.000679 seconds [default0]: number of documents: 149731 [default0]: > dataset split: [default0]: test: [default0]: document indices in [149581, 149731) total of 150 documents [default0]: > loading doc-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_12ns_2048sl_42s_doc_idx.npy [default0]: > loading sample-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_12ns_2048sl_42s_sample_idx.npy [default0]: > loading shuffle-idx mapping from /gpfsscratch/rech/six/commun/bigscience-datasets/catalogue/meg-ds-per-lang/nigercongo/bigscience-catalogue-data-dev_byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v2-dedup-lines-articles_batch_0_text_document_test_indexmap_12ns_2048sl_42s_shuffle_idx.npy [default0]: loaded indexed file in 0.002 seconds [default0]: total number of samples: 15 [default0]: total number of epochs: 2 [default0]:> building indices for blendable datasets ... [default0]: > sample ratios: [default0]: dataset 0, input: 0.0870676, achieved: 0.0870664 [default0]: dataset 1, input: 0.0207314, achieved: 0.020733 [default0]: dataset 2, input: 0.1247, achieved: 0.124699 [default0]: dataset 3, input: 0.124182, achieved: 0.12418 [default0]: dataset 4, input: 0.0029046, achieved: 0.0029059 [default0]: dataset 5, input: 0.1247, achieved: 0.124699 [default0]: dataset 6, input: 0.0659275, achieved: 0.0659284 [default0]: dataset 7, input: 0.120941, achieved: 0.12094 [default0]: dataset 8, input: 0.0310665, achieved: 0.0310676 [default0]: dataset 9, input: 0.0454631, achieved: 0.0454632 [default0]: dataset 10, input: 0.127064, achieved: 0.127063 [default0]: dataset 11, input: 0.1247, achieved: 0.124699 [default0]: dataset 12, input: 0.000554406, achieved: 0.000555736 [default0]:> elapsed time for building blendable dataset indices: 0.01 (sec) [default0]:> finished creating GPT datasets ... [default3]:[003-005] 177.6021B / 177.6021B [default3]:[003-006] 177.6021B / 177.6021B [default1]:[001-006] 177.6021B / 177.6021B [default2]:[002-009] 177.6021B / 177.6021B [default1]:[001-002] 177.6021B / 177.6021B [default7]:time (ms) | model-and-optimizer-setup: 32159.25 | train/valid/test-data-iterators-setup: 12990.59 [default2]:[002-010] 177.6021B / 177.6021B [default1]:[001-010] 177.6021B / 177.6021B [default3]:[003-003] 177.6021B / 177.6021B [default2]:[002-004] 177.6021B / 177.6021B [default0]:[000-002] 177.6021B / 177.6021B [default3]:[003-011] 191.1639B / 148.0045B [default2]:[002-011] 191.1639B / 148.0045B [default0]:[000-011] 191.1639B / 148.0045B [default0]:[000-007] 177.6021B / 177.6021B [default3]:[003-007] 177.6021B / 177.6021B [default1]:[001-011] 191.1639B / 148.0045B [default3]:[003-002] 177.6021B / 177.6021B [default0]:[000-010] 177.6021B / 177.6021B [default3]:[003-010] 177.6021B / 177.6021B [default1]:[001-007] 177.6021B / 177.6021B [default2]:[002-002] 177.6021B / 177.6021B [default3]:[003-009] 177.6021B / 177.6021B [default2]:[002-007] 177.6021B / 177.6021B [default2]:[002-003] 177.6021B / 177.6021B [default0]:[after dataloaders are built] datetime: 2022-03-04 04:09:51 [default0]:done with setup ... [default0]:training ... [default0]:Number of parameters: [tensor rank - pipeline rank] w/ and w/o embeddings: [default0]:[000-000] 191.1625B / 148.0031B [default0]:[before the start of training step] datetime: 2022-03-04 04:09:51 [default0]:[2022-03-04 04:09:51,644] [INFO] [checkpointing.py:547:forward] Activation Checkpointing Information [default0]:[2022-03-04 04:09:51,644] [INFO] [checkpointing.py:548:forward] ----Partition Activations False, CPU CHECKPOINTING False [default0]:[2022-03-04 04:09:51,644] [INFO] [checkpointing.py:551:forward] ----contiguous Memory Checkpointing False with 70 total layers [default0]:[2022-03-04 04:09:51,644] [INFO] [checkpointing.py:554:forward] ----Synchronization False [default0]:[2022-03-04 04:09:51,644] [INFO] [checkpointing.py:555:forward] ----Profiling time in checkpointing False [default3]:[003-001] 177.6021B / 177.6021B [default1]:[001-001] 177.6021B / 177.6021B [default0]:[000-004] 177.6021B / 177.6021B [default3]:[003-004] 177.6021B / 177.6021B [default3]:[003-000] 191.1625B / 148.0031B [default1]:[001-000] 191.1625B / 148.0031B [default2]:[002-000] 191.1625B / 148.0031B [default1]:[001-005] 177.6021B / 177.6021B [default2]:[002-001] 177.6021B / 177.6021B [default0]:[000-001] 177.6021B / 177.6021B [default0]:[000-009] 177.6021B / 177.6021B [default2]:[002-006] 177.6021B / 177.6021B [default0]:[000-006] 177.6021B / 177.6021B [default2]:[002-008] 177.6021B / 177.6021B [default1]:[001-008] 177.6021B / 177.6021B [default1]:[001-003] 177.6021B / 177.6021B [default0]:[000-003] 177.6021B / 177.6021B [default0]:[000-008] 177.6021B / 177.6021B [default2]:[002-005] 177.6021B / 177.6021B [default1]:[001-004] 177.6021B / 177.6021B [default3]:[003-008] 177.6021B / 177.6021B [default1]:[001-009] 177.6021B / 177.6021B [default0]:[000-005] 177.6021B / 177.6021B [default3]:[Rank 163] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default3]:[Rank 259] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default3]:[Rank 195] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default7]: iteration 4705/ 128728 | consumed samples: 75280 | consumed tokens: 154173440 | elapsed time per iteration (s): 40.42 | learning rate: 2.467E-05 | global batch size: 16 | lm loss: 8.390673E+00 | grad norm: 1.463 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 0.396 | TFLOPs: 3.03 | [default3]:[Rank 99] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default3]:[Rank 355] (after 4705 iterations) memory (MB) | allocated: 29724.1103515625 | max allocated: 41683.2236328125 | reserved: 48348.0 | max reserved: 48348.0 [default3]:[Rank 227] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default3]:[Rank 67] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default3]:[Rank 291] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default3]:[Rank 323] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default3]:[Rank 35] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default3]:[Rank 131] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default3]:[Rank 3] (after 4705 iterations) memory (MB) | allocated: 28523.97509765625 | max allocated: 40483.08837890625 | reserved: 48348.0 | max reserved: 48348.0 [default1]:[Rank 321] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default0]:[Rank 64] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default1]:[Rank 353] (after 4705 iterations) memory (MB) | allocated: 29724.1103515625 | max allocated: 41683.2236328125 | reserved: 48348.0 | max reserved: 48348.0 [default0]:[Rank 352] (after 4705 iterations) memory (MB) | allocated: 29724.1103515625 | max allocated: 41683.2236328125 | reserved: 48348.0 | max reserved: 48348.0 [default0]:[Rank 224] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default1]:[Rank 225] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default0]:[Rank 320] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default0]:[Rank 0] (after 4705 iterations) memory (MB) | allocated: 28523.97509765625 | max allocated: 40483.08837890625 | reserved: 48348.0 | max reserved: 48348.0 [default0]:[Rank 128] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default1]:[Rank 1] (after 4705 iterations) memory (MB) | allocated: 28523.97509765625 | max allocated: 40483.08837890625 | reserved: 48348.0 | max reserved: 48348.0 [default1]:[Rank 33] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default0]:[Rank 32] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default1]:[Rank 161] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default0]:[Rank 288] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default0]:[Rank 192] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default1]:[Rank 97] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default0]:[Rank 96] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default2]:[Rank 162] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default1]:[Rank 289] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default2]:[Rank 258] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default1]:[Rank 129] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default1]:[Rank 257] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default2]:[Rank 194] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default0]:[Rank 160] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default0]:[Rank 256] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default2]:[Rank 290] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default1]:[Rank 193] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default1]:[Rank 65] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default2]:[Rank 322] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default2]:[Rank 130] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default2]:[Rank 354] (after 4705 iterations) memory (MB) | allocated: 29724.1103515625 | max allocated: 41683.2236328125 | reserved: 48348.0 | max reserved: 48348.0 [default2]:[Rank 66] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default2]:[Rank 2] (after 4705 iterations) memory (MB) | allocated: 28523.97509765625 | max allocated: 40483.08837890625 | reserved: 48348.0 | max reserved: 48348.0 [default2]:[Rank 98] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default2]:[Rank 226] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default2]:[Rank 34] (after 4705 iterations) memory (MB) | allocated: 26526.0849609375 | max allocated: 37111.35546875 | reserved: 41052.0 | max reserved: 41052.0 [default7]: iteration 4706/ 128728 | consumed samples: 75296 | consumed tokens: 154206208 | elapsed time per iteration (s): 13.98 | learning rate: 2.467E-05 | global batch size: 16 | lm loss: 5.303584E+00 | grad norm: 1.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.145 | TFLOPs: 8.76 | [default7]: iteration 4707/ 128728 | consumed samples: 75312 | consumed tokens: 154238976 | elapsed time per iteration (s): 13.74 | learning rate: 2.468E-05 | global batch size: 16 | lm loss: 5.203705E+00 | grad norm: 1.793 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.165 | TFLOPs: 8.92 | [default7]: iteration 4708/ 128728 | consumed samples: 75328 | consumed tokens: 154271744 | elapsed time per iteration (s): 13.70 | learning rate: 2.468E-05 | global batch size: 16 | lm loss: 5.036973E+00 | grad norm: 0.794 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.168 | TFLOPs: 8.94 | [default7]: iteration 4709/ 128728 | consumed samples: 75344 | consumed tokens: 154304512 | elapsed time per iteration (s): 13.87 | learning rate: 2.469E-05 | global batch size: 16 | lm loss: 5.276271E+00 | grad norm: 1.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.153 | TFLOPs: 8.83 | [default7]: iteration 4710/ 128728 | consumed samples: 75360 | consumed tokens: 154337280 | elapsed time per iteration (s): 13.75 | learning rate: 2.469E-05 | global batch size: 16 | lm loss: 5.234168E+00 | grad norm: 0.785 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.164 | TFLOPs: 8.91 | [default7]: iteration 4711/ 128728 | consumed samples: 75376 | consumed tokens: 154370048 | elapsed time per iteration (s): 13.73 | learning rate: 2.470E-05 | global batch size: 16 | lm loss: 5.284269E+00 | grad norm: 0.859 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.165 | TFLOPs: 8.92 | [default7]: iteration 4712/ 128728 | consumed samples: 75392 | consumed tokens: 154402816 | elapsed time per iteration (s): 13.68 | learning rate: 2.470E-05 | global batch size: 16 | lm loss: 5.290073E+00 | grad norm: 1.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.169 | TFLOPs: 8.95 | [default7]: iteration 4713/ 128728 | consumed samples: 75408 | consumed tokens: 154435584 | elapsed time per iteration (s): 13.82 | learning rate: 2.471E-05 | global batch size: 16 | lm loss: 5.294506E+00 | grad norm: 0.785 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.158 | TFLOPs: 8.87 | [default7]: iteration 4714/ 128728 | consumed samples: 75424 | consumed tokens: 154468352 | elapsed time per iteration (s): 13.68 | learning rate: 2.471E-05 | global batch size: 16 | lm loss: 5.210737E+00 | grad norm: 0.740 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.169 | TFLOPs: 8.95 | [default7]: iteration 4715/ 128728 | consumed samples: 75440 | consumed tokens: 154501120 | elapsed time per iteration (s): 13.72 | learning rate: 2.472E-05 | global batch size: 16 | lm loss: 4.925090E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.166 | TFLOPs: 8.93 | [default7]: iteration 4716/ 128728 | consumed samples: 75456 | consumed tokens: 154533888 | elapsed time per iteration (s): 13.80 | learning rate: 2.473E-05 | global batch size: 16 | lm loss: 5.171408E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.159 | TFLOPs: 8.88 | [default7]: iteration 4717/ 128728 | consumed samples: 75472 | consumed tokens: 154566656 | elapsed time per iteration (s): 13.74 | learning rate: 2.473E-05 | global batch size: 16 | lm loss: 5.223558E+00 | grad norm: 0.678 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.164 | TFLOPs: 8.91 | [default7]: iteration 4718/ 128728 | consumed samples: 75488 | consumed tokens: 154599424 | elapsed time per iteration (s): 13.72 | learning rate: 2.474E-05 | global batch size: 16 | lm loss: 5.274587E+00 | grad norm: 1.006 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.166 | TFLOPs: 8.93 | [default7]: iteration 4719/ 128728 | consumed samples: 75504 | consumed tokens: 154632192 | elapsed time per iteration (s): 13.72 | learning rate: 2.474E-05 | global batch size: 16 | lm loss: 5.199393E+00 | grad norm: 1.240 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.166 | TFLOPs: 8.93 | [default7]: iteration 4720/ 128728 | consumed samples: 75520 | consumed tokens: 154664960 | elapsed time per iteration (s): 13.72 | learning rate: 2.475E-05 | global batch size: 16 | lm loss: 5.032928E+00 | grad norm: 0.816 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.167 | TFLOPs: 8.93 | [default7]: iteration 4721/ 128728 | consumed samples: 75536 | consumed tokens: 154697728 | elapsed time per iteration (s): 13.84 | learning rate: 2.475E-05 | global batch size: 16 | lm loss: 5.543484E+00 | grad norm: 0.858 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.156 | TFLOPs: 8.85 | [default7]: iteration 4722/ 128728 | consumed samples: 75552 | consumed tokens: 154730496 | elapsed time per iteration (s): 13.82 | learning rate: 2.476E-05 | global batch size: 16 | lm loss: 5.203832E+00 | grad norm: 0.870 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.158 | TFLOPs: 8.87 | [default7]: iteration 4723/ 128728 | consumed samples: 75568 | consumed tokens: 154763264 | elapsed time per iteration (s): 13.75 | learning rate: 2.476E-05 | global batch size: 16 | lm loss: 5.214847E+00 | grad norm: 0.689 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.164 | TFLOPs: 8.91 | [default7]: iteration 4724/ 128728 | consumed samples: 75584 | consumed tokens: 154796032 | elapsed time per iteration (s): 13.73 | learning rate: 2.477E-05 | global batch size: 16 | lm loss: 5.272194E+00 | grad norm: 3.048 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.166 | TFLOPs: 8.92 | [default7]: iteration 4725/ 128728 | consumed samples: 75600 | consumed tokens: 154828800 | elapsed time per iteration (s): 13.66 | learning rate: 2.477E-05 | global batch size: 16 | lm loss: 5.209924E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.171 | TFLOPs: 8.97 | [default7]: iteration 4726/ 128728 | consumed samples: 75616 | consumed tokens: 154861568 | elapsed time per iteration (s): 13.64 | learning rate: 2.478E-05 | global batch size: 16 | lm loss: 5.252506E+00 | grad norm: 0.942 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.173 | TFLOPs: 8.98 | [default7]: iteration 4727/ 128728 | consumed samples: 75632 | consumed tokens: 154894336 | elapsed time per iteration (s): 13.73 | learning rate: 2.478E-05 | global batch size: 16 | lm loss: 5.076056E+00 | grad norm: 0.839 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.166 | TFLOPs: 8.92 | [default7]: iteration 4728/ 128728 | consumed samples: 75648 | consumed tokens: 154927104 | elapsed time per iteration (s): 13.73 | learning rate: 2.479E-05 | global batch size: 16 | lm loss: 5.213652E+00 | grad norm: 0.807 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.165 | TFLOPs: 8.92 | [default7]: iteration 4729/ 128728 | consumed samples: 75664 | consumed tokens: 154959872 | elapsed time per iteration (s): 13.83 | learning rate: 2.479E-05 | global batch size: 16 | lm loss: 5.241081E+00 | grad norm: 1.233 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.157 | TFLOPs: 8.86 | [default7]: iteration 4730/ 128728 | consumed samples: 75680 | consumed tokens: 154992640 | elapsed time per iteration (s): 13.70 | learning rate: 2.480E-05 | global batch size: 16 | lm loss: 5.206524E+00 | grad norm: 0.861 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.168 | TFLOPs: 8.94 | [default7]: iteration 4731/ 128728 | consumed samples: 75696 | consumed tokens: 155025408 | elapsed time per iteration (s): 13.83 | learning rate: 2.480E-05 | global batch size: 16 | lm loss: 5.311900E+00 | grad norm: 1.464 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.157 | TFLOPs: 8.86 | [default7]: iteration 4732/ 128728 | consumed samples: 75712 | consumed tokens: 155058176 | elapsed time per iteration (s): 13.76 | learning rate: 2.481E-05 | global batch size: 16 | lm loss: 5.097121E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.163 | TFLOPs: 8.90 | [default7]: iteration 4733/ 128728 | consumed samples: 75728 | consumed tokens: 155090944 | elapsed time per iteration (s): 13.71 | learning rate: 2.481E-05 | global batch size: 16 | lm loss: 5.149732E+00 | grad norm: 0.787 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.167 | TFLOPs: 8.93 | [default7]: iteration 4734/ 128728 | consumed samples: 75744 | consumed tokens: 155123712 | elapsed time per iteration (s): 13.65 | learning rate: 2.482E-05 | global batch size: 16 | lm loss: 5.032346E+00 | grad norm: 0.720 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.172 | TFLOPs: 8.97 | [default7]: iteration 4735/ 128728 | consumed samples: 75760 | consumed tokens: 155156480 | elapsed time per iteration (s): 13.76 | learning rate: 2.483E-05 | global batch size: 16 | lm loss: 4.994672E+00 | grad norm: 0.769 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.163 | TFLOPs: 8.90 | [default7]: iteration 4736/ 128728 | consumed samples: 75776 | consumed tokens: 155189248 | elapsed time per iteration (s): 13.84 | learning rate: 2.483E-05 | global batch size: 16 | lm loss: 5.258005E+00 | grad norm: 0.818 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.156 | TFLOPs: 8.85 | [default7]: iteration 4737/ 128728 | consumed samples: 75792 | consumed tokens: 155222016 | elapsed time per iteration (s): 13.88 | learning rate: 2.484E-05 | global batch size: 16 | lm loss: 5.300239E+00 | grad norm: 0.819 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.153 | TFLOPs: 8.83 | [default7]: iteration 4738/ 128728 | consumed samples: 75808 | consumed tokens: 155254784 | elapsed time per iteration (s): 13.75 | learning rate: 2.484E-05 | global batch size: 16 | lm loss: 5.183598E+00 | grad norm: 0.688 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.164 | TFLOPs: 8.91 | [default7]: iteration 4739/ 128728 | consumed samples: 75824 | consumed tokens: 155287552 | elapsed time per iteration (s): 13.87 | learning rate: 2.485E-05 | global batch size: 16 | lm loss: 5.146806E+00 | grad norm: 1.094 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.154 | TFLOPs: 8.84 | [default7]: iteration 4740/ 128728 | consumed samples: 75840 | consumed tokens: 155320320 | elapsed time per iteration (s): 13.65 | learning rate: 2.485E-05 | global batch size: 16 | lm loss: 5.352815E+00 | grad norm: 0.843 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.172 | TFLOPs: 8.97 | [default7]: iteration 4741/ 128728 | consumed samples: 75856 | consumed tokens: 155353088 | elapsed time per iteration (s): 13.71 | learning rate: 2.486E-05 | global batch size: 16 | lm loss: 5.348001E+00 | grad norm: 0.817 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.167 | TFLOPs: 8.94 | [default7]: iteration 4742/ 128728 | consumed samples: 75872 | consumed tokens: 155385856 | elapsed time per iteration (s): 13.68 | learning rate: 2.486E-05 | global batch size: 16 | lm loss: 4.845537E+00 | grad norm: 1.299 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.169 | TFLOPs: 8.95 | [default7]: iteration 4743/ 128728 | consumed samples: 75888 | consumed tokens: 155418624 | elapsed time per iteration (s): 13.83 | learning rate: 2.487E-05 | global batch size: 16 | lm loss: 5.267847E+00 | grad norm: 1.306 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.157 | TFLOPs: 8.86 | [default7]: iteration 4744/ 128728 | consumed samples: 75904 | consumed tokens: 155451392 | elapsed time per iteration (s): 13.76 | learning rate: 2.487E-05 | global batch size: 16 | lm loss: 5.161267E+00 | grad norm: 0.789 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.163 | TFLOPs: 8.90 | [default7]: iteration 4745/ 128728 | consumed samples: 75920 | consumed tokens: 155484160 | elapsed time per iteration (s): 13.85 | learning rate: 2.488E-05 | global batch size: 16 | lm loss: 5.323788E+00 | grad norm: 0.939 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.155 | TFLOPs: 8.85 | [default7]: iteration 4746/ 128728 | consumed samples: 75936 | consumed tokens: 155516928 | elapsed time per iteration (s): 13.75 | learning rate: 2.488E-05 | global batch size: 16 | lm loss: 5.108951E+00 | grad norm: 0.785 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.164 | TFLOPs: 8.91 | [default7]: iteration 4747/ 128728 | consumed samples: 75952 | consumed tokens: 155549696 | elapsed time per iteration (s): 13.84 | learning rate: 2.489E-05 | global batch size: 16 | lm loss: 5.174131E+00 | grad norm: 1.248 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.156 | TFLOPs: 8.85 | [default7]: iteration 4748/ 128728 | consumed samples: 75968 | consumed tokens: 155582464 | elapsed time per iteration (s): 13.83 | learning rate: 2.489E-05 | global batch size: 16 | lm loss: 5.362530E+00 | grad norm: 0.854 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.157 | TFLOPs: 8.86 | [default7]: iteration 4749/ 128728 | consumed samples: 75984 | consumed tokens: 155615232 | elapsed time per iteration (s): 13.82 | learning rate: 2.490E-05 | global batch size: 16 | lm loss: 5.456128E+00 | grad norm: 0.830 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.158 | TFLOPs: 8.87 | [default7]: iteration 4750/ 128728 | consumed samples: 76000 | consumed tokens: 155648000 | elapsed time per iteration (s): 13.82 | learning rate: 2.490E-05 | global batch size: 16 | lm loss: 5.163225E+00 | grad norm: 0.846 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.157 | TFLOPs: 8.86 | [default7]: iteration 4751/ 128728 | consumed samples: 76016 | consumed tokens: 155680768 | elapsed time per iteration (s): 13.78 | learning rate: 2.491E-05 | global batch size: 16 | lm loss: 5.049766E+00 | grad norm: 0.776 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.161 | TFLOPs: 8.89 | [default7]: iteration 4752/ 128728 | consumed samples: 76032 | consumed tokens: 155713536 | elapsed time per iteration (s): 13.81 | learning rate: 2.491E-05 | global batch size: 16 | lm loss: 5.226779E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.158 | TFLOPs: 8.87 | [default7]: iteration 4753/ 128728 | consumed samples: 76048 | consumed tokens: 155746304 | elapsed time per iteration (s): 13.84 | learning rate: 2.492E-05 | global batch size: 16 | lm loss: 4.977962E+00 | grad norm: 1.073 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.156 | TFLOPs: 8.85 | [default7]: iteration 4754/ 128728 | consumed samples: 76064 | consumed tokens: 155779072 | elapsed time per iteration (s): 13.73 | learning rate: 2.492E-05 | global batch size: 16 | lm loss: 5.137729E+00 | grad norm: 0.835 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.166 | TFLOPs: 8.92 | [default7]: iteration 4755/ 128728 | consumed samples: 76080 | consumed tokens: 155811840 | elapsed time per iteration (s): 13.66 | learning rate: 2.493E-05 | global batch size: 16 | lm loss: 5.145767E+00 | grad norm: 0.797 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.171 | TFLOPs: 8.97 | [default7]: iteration 4756/ 128728 | consumed samples: 76096 | consumed tokens: 155844608 | elapsed time per iteration (s): 13.77 | learning rate: 2.494E-05 | global batch size: 16 | lm loss: 5.172428E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.162 | TFLOPs: 8.90 | [default7]: iteration 4757/ 128728 | consumed samples: 76112 | consumed tokens: 155877376 | elapsed time per iteration (s): 13.81 | learning rate: 2.494E-05 | global batch size: 16 | lm loss: 5.208878E+00 | grad norm: 0.820 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.158 | TFLOPs: 8.87 | [default7]: iteration 4758/ 128728 | consumed samples: 76128 | consumed tokens: 155910144 | elapsed time per iteration (s): 13.84 | learning rate: 2.495E-05 | global batch size: 16 | lm loss: 5.108291E+00 | grad norm: 2.017 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.156 | TFLOPs: 8.85 | [default7]: iteration 4759/ 128728 | consumed samples: 76144 | consumed tokens: 155942912 | elapsed time per iteration (s): 13.79 | learning rate: 2.495E-05 | global batch size: 16 | lm loss: 5.342599E+00 | grad norm: 0.919 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.160 | TFLOPs: 8.88 | [default7]: iteration 4760/ 128728 | consumed samples: 76160 | consumed tokens: 155975680 | elapsed time per iteration (s): 13.66 | learning rate: 2.496E-05 | global batch size: 16 | lm loss: 5.177962E+00 | grad norm: 0.758 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.172 | TFLOPs: 8.97 | [default7]: iteration 4761/ 128728 | consumed samples: 76176 | consumed tokens: 156008448 | elapsed time per iteration (s): 13.83 | learning rate: 2.496E-05 | global batch size: 16 | lm loss: 5.397847E+00 | grad norm: 0.924 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.157 | TFLOPs: 8.86 | [default7]: iteration 4762/ 128728 | consumed samples: 76192 | consumed tokens: 156041216 | elapsed time per iteration (s): 13.65 | learning rate: 2.497E-05 | global batch size: 16 | lm loss: 5.027542E+00 | grad norm: 0.742 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.172 | TFLOPs: 8.97 | [default7]: iteration 4763/ 128728 | consumed samples: 76208 | consumed tokens: 156073984 | elapsed time per iteration (s): 13.85 | learning rate: 2.497E-05 | global batch size: 16 | lm loss: 4.952395E+00 | grad norm: 0.863 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.156 | TFLOPs: 8.85 | [default7]: iteration 4764/ 128728 | consumed samples: 76224 | consumed tokens: 156106752 | elapsed time per iteration (s): 13.74 | learning rate: 2.498E-05 | global batch size: 16 | lm loss: 5.375393E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.165 | TFLOPs: 8.92 | [default7]: iteration 4765/ 128728 | consumed samples: 76240 | consumed tokens: 156139520 | elapsed time per iteration (s): 13.85 | learning rate: 2.498E-05 | global batch size: 16 | lm loss: 5.178174E+00 | grad norm: 0.767 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.155 | TFLOPs: 8.84 | [default7]: iteration 4766/ 128728 | consumed samples: 76256 | consumed tokens: 156172288 | elapsed time per iteration (s): 13.84 | learning rate: 2.499E-05 | global batch size: 16 | lm loss: 5.174879E+00 | grad norm: 1.049 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.156 | TFLOPs: 8.85 | [default7]: iteration 4767/ 128728 | consumed samples: 76272 | consumed tokens: 156205056 | elapsed time per iteration (s): 13.82 | learning rate: 2.499E-05 | global batch size: 16 | lm loss: 5.215740E+00 | grad norm: 1.283 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.158 | TFLOPs: 8.86 | [default7]: iteration 4768/ 128728 | consumed samples: 76288 | consumed tokens: 156237824 | elapsed time per iteration (s): 13.67 | learning rate: 2.500E-05 | global batch size: 16 | lm loss: 5.455339E+00 | grad norm: 0.748 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.171 | TFLOPs: 8.96 | [default7]: iteration 4769/ 128728 | consumed samples: 76304 | consumed tokens: 156270592 | elapsed time per iteration (s): 13.79 | learning rate: 2.500E-05 | global batch size: 16 | lm loss: 4.930388E+00 | grad norm: 0.790 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.160 | TFLOPs: 8.88 | [default7]: iteration 4770/ 128728 | consumed samples: 76320 | consumed tokens: 156303360 | elapsed time per iteration (s): 13.84 | learning rate: 2.501E-05 | global batch size: 16 | lm loss: 4.997752E+00 | grad norm: 1.218 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.156 | TFLOPs: 8.85 | [default7]: iteration 4771/ 128728 | consumed samples: 76336 | consumed tokens: 156336128 | elapsed time per iteration (s): 13.83 | learning rate: 2.501E-05 | global batch size: 16 | lm loss: 5.173059E+00 | grad norm: 0.934 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.157 | TFLOPs: 8.85 | [default7]: iteration 4772/ 128728 | consumed samples: 76352 | consumed tokens: 156368896 | elapsed time per iteration (s): 14.24 | learning rate: 2.502E-05 | global batch size: 16 | lm loss: 5.054476E+00 | grad norm: 0.613 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.124 | TFLOPs: 8.60 | [default7]: iteration 4773/ 128728 | consumed samples: 76368 | consumed tokens: 156401664 | elapsed time per iteration (s): 13.73 | learning rate: 2.502E-05 | global batch size: 16 | lm loss: 5.099241E+00 | grad norm: 0.969 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.165 | TFLOPs: 8.92 | [default7]: iteration 4774/ 128728 | consumed samples: 76384 | consumed tokens: 156434432 | elapsed time per iteration (s): 13.79 | learning rate: 2.503E-05 | global batch size: 16 | lm loss: 5.027586E+00 | grad norm: 0.812 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.160 | TFLOPs: 8.88 | [default7]: iteration 4775/ 128728 | consumed samples: 76400 | consumed tokens: 156467200 | elapsed time per iteration (s): 13.80 | learning rate: 2.503E-05 | global batch size: 16 | lm loss: 5.055077E+00 | grad norm: 1.007 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.159 | TFLOPs: 8.88 | [default7]: iteration 4776/ 128728 | consumed samples: 76416 | consumed tokens: 156499968 | elapsed time per iteration (s): 13.86 | learning rate: 2.504E-05 | global batch size: 16 | lm loss: 4.901511E+00 | grad norm: 0.656 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.154 | TFLOPs: 8.84 | [default7]: iteration 4777/ 128728 | consumed samples: 76432 | consumed tokens: 156532736 | elapsed time per iteration (s): 13.96 | learning rate: 2.505E-05 | global batch size: 16 | lm loss: 5.218966E+00 | grad norm: 0.709 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.146 | TFLOPs: 8.78 | [default7]: iteration 4778/ 128728 | consumed samples: 76448 | consumed tokens: 156565504 | elapsed time per iteration (s): 13.89 | learning rate: 2.505E-05 | global batch size: 16 | lm loss: 5.255514E+00 | grad norm: 0.800 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.152 | TFLOPs: 8.82 | [default7]: iteration 4779/ 128728 | consumed samples: 76464 | consumed tokens: 156598272 | elapsed time per iteration (s): 13.82 | learning rate: 2.506E-05 | global batch size: 16 | lm loss: 4.949065E+00 | grad norm: 0.833 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.158 | TFLOPs: 8.87 | [default7]: iteration 4780/ 128728 | consumed samples: 76480 | consumed tokens: 156631040 | elapsed time per iteration (s): 13.70 | learning rate: 2.506E-05 | global batch size: 16 | lm loss: 4.956588E+00 | grad norm: 0.668 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.168 | TFLOPs: 8.94 | [default7]: iteration 4781/ 128728 | consumed samples: 76496 | consumed tokens: 156663808 | elapsed time per iteration (s): 13.71 | learning rate: 2.507E-05 | global batch size: 16 | lm loss: 5.024817E+00 | grad norm: 1.694 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.167 | TFLOPs: 8.94 | [default7]: iteration 4782/ 128728 | consumed samples: 76512 | consumed tokens: 156696576 | elapsed time per iteration (s): 13.68 | learning rate: 2.507E-05 | global batch size: 16 | lm loss: 5.319356E+00 | grad norm: 1.751 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.169 | TFLOPs: 8.95 | [default7]: iteration 4783/ 128728 | consumed samples: 76528 | consumed tokens: 156729344 | elapsed time per iteration (s): 13.80 | learning rate: 2.508E-05 | global batch size: 16 | lm loss: 5.366149E+00 | grad norm: 0.679 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.159 | TFLOPs: 8.87 | [default7]: iteration 4784/ 128728 | consumed samples: 76544 | consumed tokens: 156762112 | elapsed time per iteration (s): 13.68 | learning rate: 2.508E-05 | global batch size: 16 | lm loss: 5.334771E+00 | grad norm: 0.810 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.170 | TFLOPs: 8.96 | [default7]: iteration 4785/ 128728 | consumed samples: 76560 | consumed tokens: 156794880 | elapsed time per iteration (s): 13.81 | learning rate: 2.509E-05 | global batch size: 16 | lm loss: 5.220145E+00 | grad norm: 0.750 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.159 | TFLOPs: 8.87 | [default7]: iteration 4786/ 128728 | consumed samples: 76576 | consumed tokens: 156827648 | elapsed time per iteration (s): 13.84 | learning rate: 2.509E-05 | global batch size: 16 | lm loss: 5.085683E+00 | grad norm: 0.759 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.156 | TFLOPs: 8.85 | [default7]: iteration 4787/ 128728 | consumed samples: 76592 | consumed tokens: 156860416 | elapsed time per iteration (s): 13.79 | learning rate: 2.510E-05 | global batch size: 16 | lm loss: 5.058179E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.160 | TFLOPs: 8.88 | [default7]: iteration 4788/ 128728 | consumed samples: 76608 | consumed tokens: 156893184 | elapsed time per iteration (s): 13.82 | learning rate: 2.510E-05 | global batch size: 16 | lm loss: 5.208087E+00 | grad norm: 0.760 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.158 | TFLOPs: 8.86 | [default7]: iteration 4789/ 128728 | consumed samples: 76624 | consumed tokens: 156925952 | elapsed time per iteration (s): 13.73 | learning rate: 2.511E-05 | global batch size: 16 | lm loss: 5.153974E+00 | grad norm: 0.916 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.165 | TFLOPs: 8.92 | [default7]: iteration 4790/ 128728 | consumed samples: 76640 | consumed tokens: 156958720 | elapsed time per iteration (s): 13.81 | learning rate: 2.511E-05 | global batch size: 16 | lm loss: 5.186059E+00 | grad norm: 0.659 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.158 | TFLOPs: 8.87 | [default7]: iteration 4791/ 128728 | consumed samples: 76656 | consumed tokens: 156991488 | elapsed time per iteration (s): 13.73 | learning rate: 2.512E-05 | global batch size: 16 | lm loss: 5.013607E+00 | grad norm: 0.863 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.165 | TFLOPs: 8.92 | [default7]: iteration 4792/ 128728 | consumed samples: 76672 | consumed tokens: 157024256 | elapsed time per iteration (s): 13.79 | learning rate: 2.512E-05 | global batch size: 16 | lm loss: 5.210199E+00 | grad norm: 0.811 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.160 | TFLOPs: 8.88 | [default7]: iteration 4793/ 128728 | consumed samples: 76688 | consumed tokens: 157057024 | elapsed time per iteration (s): 13.66 | learning rate: 2.513E-05 | global batch size: 16 | lm loss: 5.175740E+00 | grad norm: 0.947 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.171 | TFLOPs: 8.97 | [default7]: iteration 4794/ 128728 | consumed samples: 76704 | consumed tokens: 157089792 | elapsed time per iteration (s): 13.70 | learning rate: 2.513E-05 | global batch size: 16 | lm loss: 5.095262E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.168 | TFLOPs: 8.94 | [default7]: iteration 4795/ 128728 | consumed samples: 76720 | consumed tokens: 157122560 | elapsed time per iteration (s): 13.79 | learning rate: 2.514E-05 | global batch size: 16 | lm loss: 4.972818E+00 | grad norm: 0.700 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.160 | TFLOPs: 8.88 | [default7]: iteration 4796/ 128728 | consumed samples: 76736 | consumed tokens: 157155328 | elapsed time per iteration (s): 13.74 | learning rate: 2.514E-05 | global batch size: 16 | lm loss: 5.033150E+00 | grad norm: 1.104 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.165 | TFLOPs: 8.92 | [default7]: iteration 4797/ 128728 | consumed samples: 76752 | consumed tokens: 157188096 | elapsed time per iteration (s): 13.87 | learning rate: 2.515E-05 | global batch size: 16 | lm loss: 5.181136E+00 | grad norm: 0.765 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.154 | TFLOPs: 8.83 | [default7]: iteration 4798/ 128728 | consumed samples: 76768 | consumed tokens: 157220864 | elapsed time per iteration (s): 13.75 | learning rate: 2.516E-05 | global batch size: 16 | lm loss: 5.075924E+00 | grad norm: 1.012 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.163 | TFLOPs: 8.91 | [default7]: iteration 4799/ 128728 | consumed samples: 76784 | consumed tokens: 157253632 | elapsed time per iteration (s): 13.72 | learning rate: 2.516E-05 | global batch size: 16 | lm loss: 4.798205E+00 | grad norm: 2.782 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.166 | TFLOPs: 8.93 | [default7]: iteration 4800/ 128728 | consumed samples: 76800 | consumed tokens: 157286400 | elapsed time per iteration (s): 13.86 | learning rate: 2.517E-05 | global batch size: 16 | lm loss: 5.076591E+00 | grad norm: 2.086 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.155 | TFLOPs: 8.84 | [default7]: iteration 4801/ 128728 | consumed samples: 76816 | consumed tokens: 157319168 | elapsed time per iteration (s): 13.85 | learning rate: 2.517E-05 | global batch size: 16 | lm loss: 5.293148E+00 | grad norm: 3.280 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.155 | TFLOPs: 8.84 | [default7]: iteration 4802/ 128728 | consumed samples: 76832 | consumed tokens: 157351936 | elapsed time per iteration (s): 13.82 | learning rate: 2.518E-05 | global batch size: 16 | lm loss: 5.133687E+00 | grad norm: 0.839 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.157 | TFLOPs: 8.86 | [default7]: iteration 4803/ 128728 | consumed samples: 76848 | consumed tokens: 157384704 | elapsed time per iteration (s): 13.65 | learning rate: 2.518E-05 | global batch size: 16 | lm loss: 5.139082E+00 | grad norm: 0.779 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.172 | TFLOPs: 8.97 | [default7]: iteration 4804/ 128728 | consumed samples: 76864 | consumed tokens: 157417472 | elapsed time per iteration (s): 13.81 | learning rate: 2.519E-05 | global batch size: 16 | lm loss: 5.191136E+00 | grad norm: 1.038 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.159 | TFLOPs: 8.87 | [default7]: iteration 4805/ 128728 | consumed samples: 76880 | consumed tokens: 157450240 | elapsed time per iteration (s): 14.27 | learning rate: 2.519E-05 | global batch size: 16 | lm loss: 5.444860E+00 | grad norm: 0.895 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.121 | TFLOPs: 8.59 | [default7]: iteration 4806/ 128728 | consumed samples: 76896 | consumed tokens: 157483008 | elapsed time per iteration (s): 13.63 | learning rate: 2.520E-05 | global batch size: 16 | lm loss: 5.277452E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.174 | TFLOPs: 8.99 | [default7]: iteration 4807/ 128728 | consumed samples: 76928 | consumed tokens: 157548544 | elapsed time per iteration (s): 14.41 | learning rate: 2.521E-05 | global batch size: 32 | lm loss: 5.110476E+00 | grad norm: 0.556 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.220 | TFLOPs: 17.00 | [default7]: iteration 4808/ 128728 | consumed samples: 76960 | consumed tokens: 157614080 | elapsed time per iteration (s): 14.44 | learning rate: 2.522E-05 | global batch size: 32 | lm loss: 5.159946E+00 | grad norm: 0.708 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.216 | TFLOPs: 16.97 | [default7]: iteration 4809/ 128728 | consumed samples: 76992 | consumed tokens: 157679616 | elapsed time per iteration (s): 14.37 | learning rate: 2.523E-05 | global batch size: 32 | lm loss: 5.098501E+00 | grad norm: 0.770 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.227 | TFLOPs: 17.05 | [default7]: iteration 4810/ 128728 | consumed samples: 77024 | consumed tokens: 157745152 | elapsed time per iteration (s): 14.41 | learning rate: 2.524E-05 | global batch size: 32 | lm loss: 5.236533E+00 | grad norm: 0.541 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.221 | TFLOPs: 17.00 | [default7]: iteration 4811/ 128728 | consumed samples: 77056 | consumed tokens: 157810688 | elapsed time per iteration (s): 14.48 | learning rate: 2.525E-05 | global batch size: 32 | lm loss: 5.184154E+00 | grad norm: 0.506 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.210 | TFLOPs: 16.92 | [default7]: iteration 4812/ 128728 | consumed samples: 77088 | consumed tokens: 157876224 | elapsed time per iteration (s): 14.52 | learning rate: 2.526E-05 | global batch size: 32 | lm loss: 5.250757E+00 | grad norm: 0.555 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.203 | TFLOPs: 16.87 | [default7]: iteration 4813/ 128728 | consumed samples: 77120 | consumed tokens: 157941760 | elapsed time per iteration (s): 14.41 | learning rate: 2.527E-05 | global batch size: 32 | lm loss: 5.150150E+00 | grad norm: 0.669 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.221 | TFLOPs: 17.01 | [default7]: iteration 4814/ 128728 | consumed samples: 77152 | consumed tokens: 158007296 | elapsed time per iteration (s): 14.37 | learning rate: 2.528E-05 | global batch size: 32 | lm loss: 4.867079E+00 | grad norm: 0.546 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.226 | TFLOPs: 17.05 | [default7]: iteration 4815/ 128728 | consumed samples: 77184 | consumed tokens: 158072832 | elapsed time per iteration (s): 14.35 | learning rate: 2.529E-05 | global batch size: 32 | lm loss: 5.103984E+00 | grad norm: 0.886 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.231 | TFLOPs: 17.08 | [default7]: iteration 4816/ 128728 | consumed samples: 77216 | consumed tokens: 158138368 | elapsed time per iteration (s): 14.41 | learning rate: 2.530E-05 | global batch size: 32 | lm loss: 5.172581E+00 | grad norm: 0.523 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.220 | TFLOPs: 17.00 | [default7]: iteration 4817/ 128728 | consumed samples: 77248 | consumed tokens: 158203904 | elapsed time per iteration (s): 14.34 | learning rate: 2.531E-05 | global batch size: 32 | lm loss: 5.039461E+00 | grad norm: 0.612 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.232 | TFLOPs: 17.09 | [default7]: iteration 4818/ 128728 | consumed samples: 77280 | consumed tokens: 158269440 | elapsed time per iteration (s): 14.45 | learning rate: 2.532E-05 | global batch size: 32 | lm loss: 5.033366E+00 | grad norm: 0.591 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.214 | TFLOPs: 16.95 | [default7]: iteration 4819/ 128728 | consumed samples: 77312 | consumed tokens: 158334976 | elapsed time per iteration (s): 14.52 | learning rate: 2.533E-05 | global batch size: 32 | lm loss: 5.019548E+00 | grad norm: 0.507 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.203 | TFLOPs: 16.87 | [default7]: iteration 4820/ 128728 | consumed samples: 77344 | consumed tokens: 158400512 | elapsed time per iteration (s): 14.47 | learning rate: 2.534E-05 | global batch size: 32 | lm loss: 5.029814E+00 | grad norm: 0.528 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.212 | TFLOPs: 16.94 | [default7]: iteration 4821/ 128728 | consumed samples: 77376 | consumed tokens: 158466048 | elapsed time per iteration (s): 14.48 | learning rate: 2.535E-05 | global batch size: 32 | lm loss: 5.075526E+00 | grad norm: 0.533 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.210 | TFLOPs: 16.92 | [default7]: iteration 4822/ 128728 | consumed samples: 77408 | consumed tokens: 158531584 | elapsed time per iteration (s): 14.49 | learning rate: 2.537E-05 | global batch size: 32 | lm loss: 5.179887E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.209 | TFLOPs: 16.91 | [default7]: iteration 4823/ 128728 | consumed samples: 77440 | consumed tokens: 158597120 | elapsed time per iteration (s): 14.46 | learning rate: 2.538E-05 | global batch size: 32 | lm loss: 4.963607E+00 | grad norm: 0.576 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.213 | TFLOPs: 16.94 | [default7]: iteration 4824/ 128728 | consumed samples: 77472 | consumed tokens: 158662656 | elapsed time per iteration (s): 14.29 | learning rate: 2.539E-05 | global batch size: 32 | lm loss: 5.011718E+00 | grad norm: 0.528 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.239 | TFLOPs: 17.14 | [default7]: iteration 4825/ 128728 | consumed samples: 77504 | consumed tokens: 158728192 | elapsed time per iteration (s): 14.49 | learning rate: 2.540E-05 | global batch size: 32 | lm loss: 4.995124E+00 | grad norm: 0.552 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.208 | TFLOPs: 16.91 | [default7]: iteration 4826/ 128728 | consumed samples: 77536 | consumed tokens: 158793728 | elapsed time per iteration (s): 14.44 | learning rate: 2.541E-05 | global batch size: 32 | lm loss: 5.081669E+00 | grad norm: 0.486 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.217 | TFLOPs: 16.97 | [default7]: iteration 4827/ 128728 | consumed samples: 77568 | consumed tokens: 158859264 | elapsed time per iteration (s): 14.53 | learning rate: 2.542E-05 | global batch size: 32 | lm loss: 5.067815E+00 | grad norm: 0.541 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.203 | TFLOPs: 16.87 | [default7]: iteration 4828/ 128728 | consumed samples: 77600 | consumed tokens: 158924800 | elapsed time per iteration (s): 14.45 | learning rate: 2.543E-05 | global batch size: 32 | lm loss: 4.991805E+00 | grad norm: 0.560 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.215 | TFLOPs: 16.96 | [default7]: iteration 4829/ 128728 | consumed samples: 77632 | consumed tokens: 158990336 | elapsed time per iteration (s): 14.37 | learning rate: 2.544E-05 | global batch size: 32 | lm loss: 5.154213E+00 | grad norm: 0.698 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.227 | TFLOPs: 17.05 | [default7]: iteration 4830/ 128728 | consumed samples: 77664 | consumed tokens: 159055872 | elapsed time per iteration (s): 14.53 | learning rate: 2.545E-05 | global batch size: 32 | lm loss: 4.978602E+00 | grad norm: 0.477 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.202 | TFLOPs: 16.86 | [default7]: iteration 4831/ 128728 | consumed samples: 77696 | consumed tokens: 159121408 | elapsed time per iteration (s): 14.47 | learning rate: 2.546E-05 | global batch size: 32 | lm loss: 4.966110E+00 | grad norm: 0.697 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.212 | TFLOPs: 16.93 | [default7]: iteration 4832/ 128728 | consumed samples: 77728 | consumed tokens: 159186944 | elapsed time per iteration (s): 14.36 | learning rate: 2.547E-05 | global batch size: 32 | lm loss: 4.906848E+00 | grad norm: 0.501 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.228 | TFLOPs: 17.06 | [default7]: iteration 4833/ 128728 | consumed samples: 77760 | consumed tokens: 159252480 | elapsed time per iteration (s): 14.40 | learning rate: 2.548E-05 | global batch size: 32 | lm loss: 4.992458E+00 | grad norm: 0.605 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.222 | TFLOPs: 17.01 | [default7]: iteration 4834/ 128728 | consumed samples: 77792 | consumed tokens: 159318016 | elapsed time per iteration (s): 14.37 | learning rate: 2.549E-05 | global batch size: 32 | lm loss: 4.984800E+00 | grad norm: 0.570 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.227 | TFLOPs: 17.05 | [default7]: iteration 4835/ 128728 | consumed samples: 77824 | consumed tokens: 159383552 | elapsed time per iteration (s): 14.50 | learning rate: 2.550E-05 | global batch size: 32 | lm loss: 5.221433E+00 | grad norm: 0.495 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.208 | TFLOPs: 16.90 | [default7]: iteration 4836/ 128728 | consumed samples: 77856 | consumed tokens: 159449088 | elapsed time per iteration (s): 14.65 | learning rate: 2.551E-05 | global batch size: 32 | lm loss: 4.936250E+00 | grad norm: 0.615 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.185 | TFLOPs: 16.73 | [default7]: iteration 4837/ 128728 | consumed samples: 77888 | consumed tokens: 159514624 | elapsed time per iteration (s): 14.55 | learning rate: 2.552E-05 | global batch size: 32 | lm loss: 4.874154E+00 | grad norm: 0.537 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.200 | TFLOPs: 16.84 | [default7]: iteration 4838/ 128728 | consumed samples: 77920 | consumed tokens: 159580160 | elapsed time per iteration (s): 14.53 | learning rate: 2.553E-05 | global batch size: 32 | lm loss: 5.190948E+00 | grad norm: 0.435 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.203 | TFLOPs: 16.87 | [default7]: iteration 4839/ 128728 | consumed samples: 77952 | consumed tokens: 159645696 | elapsed time per iteration (s): 14.42 | learning rate: 2.554E-05 | global batch size: 32 | lm loss: 5.015795E+00 | grad norm: 4.281 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.220 | TFLOPs: 16.99 | [default7]: iteration 4840/ 128728 | consumed samples: 77984 | consumed tokens: 159711232 | elapsed time per iteration (s): 14.44 | learning rate: 2.555E-05 | global batch size: 32 | lm loss: 5.077456E+00 | grad norm: 0.479 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.216 | TFLOPs: 16.97 | [default7]: iteration 4841/ 128728 | consumed samples: 78016 | consumed tokens: 159776768 | elapsed time per iteration (s): 14.35 | learning rate: 2.556E-05 | global batch size: 32 | lm loss: 5.229739E+00 | grad norm: 0.519 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.229 | TFLOPs: 17.07 | [default7]: iteration 4842/ 128728 | consumed samples: 78048 | consumed tokens: 159842304 | elapsed time per iteration (s): 14.39 | learning rate: 2.557E-05 | global batch size: 32 | lm loss: 5.039967E+00 | grad norm: 0.505 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.224 | TFLOPs: 17.03 | [default7]: iteration 4843/ 128728 | consumed samples: 78080 | consumed tokens: 159907840 | elapsed time per iteration (s): 14.51 | learning rate: 2.559E-05 | global batch size: 32 | lm loss: 5.084831E+00 | grad norm: 0.704 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.205 | TFLOPs: 16.88 | [default7]: iteration 4844/ 128728 | consumed samples: 78112 | consumed tokens: 159973376 | elapsed time per iteration (s): 14.48 | learning rate: 2.560E-05 | global batch size: 32 | lm loss: 4.989566E+00 | grad norm: 0.514 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.210 | TFLOPs: 16.92 | [default7]: iteration 4845/ 128728 | consumed samples: 78144 | consumed tokens: 160038912 | elapsed time per iteration (s): 14.33 | learning rate: 2.561E-05 | global batch size: 32 | lm loss: 4.973344E+00 | grad norm: 0.531 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.233 | TFLOPs: 17.09 | [default7]: iteration 4846/ 128728 | consumed samples: 78176 | consumed tokens: 160104448 | elapsed time per iteration (s): 14.37 | learning rate: 2.562E-05 | global batch size: 32 | lm loss: 5.007797E+00 | grad norm: 0.950 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.227 | TFLOPs: 17.05 | [default7]: iteration 4847/ 128728 | consumed samples: 78208 | consumed tokens: 160169984 | elapsed time per iteration (s): 14.49 | learning rate: 2.563E-05 | global batch size: 32 | lm loss: 5.095990E+00 | grad norm: 0.499 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.208 | TFLOPs: 16.91 | [default7]: iteration 4848/ 128728 | consumed samples: 78240 | consumed tokens: 160235520 | elapsed time per iteration (s): 14.37 | learning rate: 2.564E-05 | global batch size: 32 | lm loss: 5.174461E+00 | grad norm: 0.525 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.227 | TFLOPs: 17.05 | [default7]: iteration 4849/ 128728 | consumed samples: 78272 | consumed tokens: 160301056 | elapsed time per iteration (s): 14.49 | learning rate: 2.565E-05 | global batch size: 32 | lm loss: 5.072275E+00 | grad norm: 0.447 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.209 | TFLOPs: 16.91 | [default7]: iteration 4850/ 128728 | consumed samples: 78304 | consumed tokens: 160366592 | elapsed time per iteration (s): 14.40 | learning rate: 2.566E-05 | global batch size: 32 | lm loss: 4.968595E+00 | grad norm: 0.489 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.223 | TFLOPs: 17.02 | [default7]: iteration 4851/ 128728 | consumed samples: 78336 | consumed tokens: 160432128 | elapsed time per iteration (s): 14.42 | learning rate: 2.567E-05 | global batch size: 32 | lm loss: 5.029985E+00 | grad norm: 0.944 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.219 | TFLOPs: 16.99 | [default7]: iteration 4852/ 128728 | consumed samples: 78368 | consumed tokens: 160497664 | elapsed time per iteration (s): 14.38 | learning rate: 2.568E-05 | global batch size: 32 | lm loss: 4.903277E+00 | grad norm: 0.552 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.225 | TFLOPs: 17.04 | [default7]: iteration 4853/ 128728 | consumed samples: 78400 | consumed tokens: 160563200 | elapsed time per iteration (s): 14.40 | learning rate: 2.569E-05 | global batch size: 32 | lm loss: 5.001978E+00 | grad norm: 0.441 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.222 | TFLOPs: 17.01 | [default7]: iteration 4854/ 128728 | consumed samples: 78432 | consumed tokens: 160628736 | elapsed time per iteration (s): 14.35 | learning rate: 2.570E-05 | global batch size: 32 | lm loss: 4.934483E+00 | grad norm: 0.468 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.229 | TFLOPs: 17.07 | [default7]: iteration 4855/ 128728 | consumed samples: 78464 | consumed tokens: 160694272 | elapsed time per iteration (s): 14.33 | learning rate: 2.571E-05 | global batch size: 32 | lm loss: 4.979787E+00 | grad norm: 0.707 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.234 | TFLOPs: 17.10 | [default7]: iteration 4856/ 128728 | consumed samples: 78496 | consumed tokens: 160759808 | elapsed time per iteration (s): 14.42 | learning rate: 2.572E-05 | global batch size: 32 | lm loss: 4.876790E+00 | grad norm: 0.576 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.219 | TFLOPs: 16.99 | [default7]: iteration 4857/ 128728 | consumed samples: 78528 | consumed tokens: 160825344 | elapsed time per iteration (s): 14.50 | learning rate: 2.573E-05 | global batch size: 32 | lm loss: 5.032100E+00 | grad norm: 0.661 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.207 | TFLOPs: 16.90 | [default7]: iteration 4858/ 128728 | consumed samples: 78560 | consumed tokens: 160890880 | elapsed time per iteration (s): 14.39 | learning rate: 2.574E-05 | global batch size: 32 | lm loss: 4.934223E+00 | grad norm: 0.976 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.225 | TFLOPs: 17.03 | [default7]: iteration 4859/ 128728 | consumed samples: 78592 | consumed tokens: 160956416 | elapsed time per iteration (s): 14.98 | learning rate: 2.575E-05 | global batch size: 32 | lm loss: 4.762866E+00 | grad norm: 0.695 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.136 | TFLOPs: 16.35 | [default7]: iteration 4860/ 128728 | consumed samples: 78624 | consumed tokens: 161021952 | elapsed time per iteration (s): 14.42 | learning rate: 2.576E-05 | global batch size: 32 | lm loss: 5.198421E+00 | grad norm: 0.506 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.219 | TFLOPs: 16.99 | [default7]: iteration 4861/ 128728 | consumed samples: 78656 | consumed tokens: 161087488 | elapsed time per iteration (s): 14.40 | learning rate: 2.577E-05 | global batch size: 32 | lm loss: 4.902623E+00 | grad norm: 0.532 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.223 | TFLOPs: 17.02 | [default7]: iteration 4862/ 128728 | consumed samples: 78688 | consumed tokens: 161153024 | elapsed time per iteration (s): 14.43 | learning rate: 2.578E-05 | global batch size: 32 | lm loss: 4.889926E+00 | grad norm: 1.259 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.218 | TFLOPs: 16.98 | [default7]: iteration 4863/ 128728 | consumed samples: 78720 | consumed tokens: 161218560 | elapsed time per iteration (s): 14.52 | learning rate: 2.580E-05 | global batch size: 32 | lm loss: 4.984774E+00 | grad norm: 0.465 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.203 | TFLOPs: 16.87 | [default7]: iteration 4864/ 128728 | consumed samples: 78752 | consumed tokens: 161284096 | elapsed time per iteration (s): 14.36 | learning rate: 2.581E-05 | global batch size: 32 | lm loss: 4.985258E+00 | grad norm: 0.614 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.228 | TFLOPs: 17.06 | [default7]: iteration 4865/ 128728 | consumed samples: 78784 | consumed tokens: 161349632 | elapsed time per iteration (s): 14.49 | learning rate: 2.582E-05 | global batch size: 32 | lm loss: 5.015450E+00 | grad norm: 0.651 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.208 | TFLOPs: 16.91 | [default7]: iteration 4866/ 128728 | consumed samples: 78816 | consumed tokens: 161415168 | elapsed time per iteration (s): 14.46 | learning rate: 2.583E-05 | global batch size: 32 | lm loss: 4.877583E+00 | grad norm: 0.511 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.213 | TFLOPs: 16.94 | [default7]: iteration 4867/ 128728 | consumed samples: 78848 | consumed tokens: 161480704 | elapsed time per iteration (s): 14.46 | learning rate: 2.584E-05 | global batch size: 32 | lm loss: 5.241075E+00 | grad norm: 0.617 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.213 | TFLOPs: 16.94 | [default7]: iteration 4868/ 128728 | consumed samples: 78880 | consumed tokens: 161546240 | elapsed time per iteration (s): 14.48 | learning rate: 2.585E-05 | global batch size: 32 | lm loss: 4.912823E+00 | grad norm: 0.481 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.211 | TFLOPs: 16.92 | [default7]: iteration 4869/ 128728 | consumed samples: 78912 | consumed tokens: 161611776 | elapsed time per iteration (s): 14.97 | learning rate: 2.586E-05 | global batch size: 32 | lm loss: 4.992380E+00 | grad norm: 0.570 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.138 | TFLOPs: 16.37 | [default7]: iteration 4870/ 128728 | consumed samples: 78944 | consumed tokens: 161677312 | elapsed time per iteration (s): 14.45 | learning rate: 2.587E-05 | global batch size: 32 | lm loss: 5.035939E+00 | grad norm: 0.463 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.214 | TFLOPs: 16.95 | [default7]: iteration 4871/ 128728 | consumed samples: 78976 | consumed tokens: 161742848 | elapsed time per iteration (s): 14.36 | learning rate: 2.588E-05 | global batch size: 32 | lm loss: 4.827978E+00 | grad norm: 0.855 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.229 | TFLOPs: 17.07 | [default7]: iteration 4872/ 128728 | consumed samples: 79008 | consumed tokens: 161808384 | elapsed time per iteration (s): 14.41 | learning rate: 2.589E-05 | global batch size: 32 | lm loss: 4.985816E+00 | grad norm: 0.479 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.220 | TFLOPs: 17.00 | [default7]: iteration 4873/ 128728 | consumed samples: 79040 | consumed tokens: 161873920 | elapsed time per iteration (s): 14.40 | learning rate: 2.590E-05 | global batch size: 32 | lm loss: 4.936251E+00 | grad norm: 0.564 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.223 | TFLOPs: 17.02 | [default7]: iteration 4874/ 128728 | consumed samples: 79072 | consumed tokens: 161939456 | elapsed time per iteration (s): 14.52 | learning rate: 2.591E-05 | global batch size: 32 | lm loss: 4.892041E+00 | grad norm: 0.951 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.203 | TFLOPs: 16.87 | [default7]: iteration 4875/ 128728 | consumed samples: 79104 | consumed tokens: 162004992 | elapsed time per iteration (s): 14.65 | learning rate: 2.592E-05 | global batch size: 32 | lm loss: 4.844186E+00 | grad norm: 0.546 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.184 | TFLOPs: 16.72 | [default7]: iteration 4876/ 128728 | consumed samples: 79136 | consumed tokens: 162070528 | elapsed time per iteration (s): 14.39 | learning rate: 2.593E-05 | global batch size: 32 | lm loss: 5.113724E+00 | grad norm: 0.485 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.223 | TFLOPs: 17.02 | [default7]: iteration 4877/ 128728 | consumed samples: 79168 | consumed tokens: 162136064 | elapsed time per iteration (s): 14.43 | learning rate: 2.594E-05 | global batch size: 32 | lm loss: 5.039042E+00 | grad norm: 0.543 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.217 | TFLOPs: 16.97 | [default7]: iteration 4878/ 128728 | consumed samples: 79200 | consumed tokens: 162201600 | elapsed time per iteration (s): 14.45 | learning rate: 2.595E-05 | global batch size: 32 | lm loss: 5.142283E+00 | grad norm: 0.858 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.215 | TFLOPs: 16.96 | [default7]: iteration 4879/ 128728 | consumed samples: 79232 | consumed tokens: 162267136 | elapsed time per iteration (s): 14.42 | learning rate: 2.596E-05 | global batch size: 32 | lm loss: 4.902722E+00 | grad norm: 0.561 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.220 | TFLOPs: 17.00 | [default7]: iteration 4880/ 128728 | consumed samples: 79264 | consumed tokens: 162332672 | elapsed time per iteration (s): 14.45 | learning rate: 2.597E-05 | global batch size: 32 | lm loss: 4.755108E+00 | grad norm: 1.026 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.215 | TFLOPs: 16.96 | [default7]: iteration 4881/ 128728 | consumed samples: 79296 | consumed tokens: 162398208 | elapsed time per iteration (s): 14.46 | learning rate: 2.598E-05 | global batch size: 32 | lm loss: 4.935410E+00 | grad norm: 0.542 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.213 | TFLOPs: 16.94 | [default7]: iteration 4882/ 128728 | consumed samples: 79328 | consumed tokens: 162463744 | elapsed time per iteration (s): 14.34 | learning rate: 2.599E-05 | global batch size: 32 | lm loss: 5.047359E+00 | grad norm: 0.530 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.231 | TFLOPs: 17.08 | [default7]: iteration 4883/ 128728 | consumed samples: 79360 | consumed tokens: 162529280 | elapsed time per iteration (s): 14.42 | learning rate: 2.600E-05 | global batch size: 32 | lm loss: 4.720992E+00 | grad norm: 0.695 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.219 | TFLOPs: 16.99 | [default7]: iteration 4884/ 128728 | consumed samples: 79392 | consumed tokens: 162594816 | elapsed time per iteration (s): 14.49 | learning rate: 2.602E-05 | global batch size: 32 | lm loss: 4.991364E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.209 | TFLOPs: 16.91 | [default7]: iteration 4885/ 128728 | consumed samples: 79424 | consumed tokens: 162660352 | elapsed time per iteration (s): 14.50 | learning rate: 2.603E-05 | global batch size: 32 | lm loss: 4.920027E+00 | grad norm: 0.495 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.207 | TFLOPs: 16.90 | [default7]: iteration 4886/ 128728 | consumed samples: 79456 | consumed tokens: 162725888 | elapsed time per iteration (s): 14.48 | learning rate: 2.604E-05 | global batch size: 32 | lm loss: 4.976588E+00 | grad norm: 0.508 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.210 | TFLOPs: 16.92 | [default7]: iteration 4887/ 128728 | consumed samples: 79488 | consumed tokens: 162791424 | elapsed time per iteration (s): 14.48 | learning rate: 2.605E-05 | global batch size: 32 | lm loss: 4.887921E+00 | grad norm: 0.499 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.210 | TFLOPs: 16.92 | [default7]: iteration 4888/ 128728 | consumed samples: 79520 | consumed tokens: 162856960 | elapsed time per iteration (s): 14.47 | learning rate: 2.606E-05 | global batch size: 32 | lm loss: 4.977568E+00 | grad norm: 0.608 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.211 | TFLOPs: 16.93 | [default7]: iteration 4889/ 128728 | consumed samples: 79552 | consumed tokens: 162922496 | elapsed time per iteration (s): 14.85 | learning rate: 2.607E-05 | global batch size: 32 | lm loss: 5.021329E+00 | grad norm: 0.459 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.154 | TFLOPs: 16.49 | [default7]: iteration 4890/ 128728 | consumed samples: 79584 | consumed tokens: 162988032 | elapsed time per iteration (s): 14.50 | learning rate: 2.608E-05 | global batch size: 32 | lm loss: 5.071374E+00 | grad norm: 0.667 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.208 | TFLOPs: 16.90 | [default7]: iteration 4891/ 128728 | consumed samples: 79616 | consumed tokens: 163053568 | elapsed time per iteration (s): 14.71 | learning rate: 2.609E-05 | global batch size: 32 | lm loss: 5.032025E+00 | grad norm: 1.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.175 | TFLOPs: 16.66 | [default7]: iteration 4892/ 128728 | consumed samples: 79648 | consumed tokens: 163119104 | elapsed time per iteration (s): 14.49 | learning rate: 2.610E-05 | global batch size: 32 | lm loss: 5.014086E+00 | grad norm: 0.675 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.208 | TFLOPs: 16.90 | [default7]: iteration 4893/ 128728 | consumed samples: 79680 | consumed tokens: 163184640 | elapsed time per iteration (s): 14.49 | learning rate: 2.611E-05 | global batch size: 32 | lm loss: 4.995523E+00 | grad norm: 0.495 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.208 | TFLOPs: 16.90 | [default7]: iteration 4894/ 128728 | consumed samples: 79712 | consumed tokens: 163250176 | elapsed time per iteration (s): 14.42 | learning rate: 2.612E-05 | global batch size: 32 | lm loss: 4.978586E+00 | grad norm: 0.593 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.219 | TFLOPs: 16.99 | [default7]: iteration 4895/ 128728 | consumed samples: 79744 | consumed tokens: 163315712 | elapsed time per iteration (s): 14.41 | learning rate: 2.613E-05 | global batch size: 32 | lm loss: 4.924498E+00 | grad norm: 0.585 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.220 | TFLOPs: 17.00 | [default7]: iteration 4896/ 128728 | consumed samples: 79776 | consumed tokens: 163381248 | elapsed time per iteration (s): 14.66 | learning rate: 2.614E-05 | global batch size: 32 | lm loss: 4.915054E+00 | grad norm: 0.567 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.182 | TFLOPs: 16.71 | [default7]: iteration 4897/ 128728 | consumed samples: 79808 | consumed tokens: 163446784 | elapsed time per iteration (s): 14.46 | learning rate: 2.615E-05 | global batch size: 32 | lm loss: 5.211232E+00 | grad norm: 0.758 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.213 | TFLOPs: 16.94 | [default7]: iteration 4898/ 128728 | consumed samples: 79840 | consumed tokens: 163512320 | elapsed time per iteration (s): 14.91 | learning rate: 2.616E-05 | global batch size: 32 | lm loss: 4.795869E+00 | grad norm: 0.541 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.146 | TFLOPs: 16.43 | [default7]: iteration 4899/ 128728 | consumed samples: 79872 | consumed tokens: 163577856 | elapsed time per iteration (s): 14.43 | learning rate: 2.617E-05 | global batch size: 32 | lm loss: 4.986282E+00 | grad norm: 0.561 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.218 | TFLOPs: 16.98 | [default7]: iteration 4900/ 128728 | consumed samples: 79904 | consumed tokens: 163643392 | elapsed time per iteration (s): 14.44 | learning rate: 2.618E-05 | global batch size: 32 | lm loss: 5.010670E+00 | grad norm: 0.482 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.216 | TFLOPs: 16.96 | [default7]: iteration 4901/ 128728 | consumed samples: 79936 | consumed tokens: 163708928 | elapsed time per iteration (s): 14.42 | learning rate: 2.619E-05 | global batch size: 32 | lm loss: 4.981537E+00 | grad norm: 0.545 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.219 | TFLOPs: 16.99 | [default7]: iteration 4902/ 128728 | consumed samples: 79968 | consumed tokens: 163774464 | elapsed time per iteration (s): 14.41 | learning rate: 2.620E-05 | global batch size: 32 | lm loss: 5.037811E+00 | grad norm: 0.544 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.220 | TFLOPs: 17.00 | [default7]: iteration 4903/ 128728 | consumed samples: 80000 | consumed tokens: 163840000 | elapsed time per iteration (s): 14.50 | learning rate: 2.621E-05 | global batch size: 32 | lm loss: 5.033264E+00 | grad norm: 0.535 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.207 | TFLOPs: 16.90 | [default7]: iteration 4904/ 128728 | consumed samples: 80032 | consumed tokens: 163905536 | elapsed time per iteration (s): 14.46 | learning rate: 2.622E-05 | global batch size: 32 | lm loss: 4.824915E+00 | grad norm: 0.552 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.213 | TFLOPs: 16.94 | [default7]: iteration 4905/ 128728 | consumed samples: 80064 | consumed tokens: 163971072 | elapsed time per iteration (s): 14.31 | learning rate: 2.624E-05 | global batch size: 32 | lm loss: 5.107170E+00 | grad norm: 0.627 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.237 | TFLOPs: 17.13 | [default7]: iteration 4906/ 128728 | consumed samples: 80096 | consumed tokens: 164036608 | elapsed time per iteration (s): 14.43 | learning rate: 2.625E-05 | global batch size: 32 | lm loss: 5.018471E+00 | grad norm: 0.586 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.217 | TFLOPs: 16.97 | [default7]: iteration 4907/ 128728 | consumed samples: 80128 | consumed tokens: 164102144 | elapsed time per iteration (s): 14.34 | learning rate: 2.626E-05 | global batch size: 32 | lm loss: 4.920955E+00 | grad norm: 0.483 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.232 | TFLOPs: 17.09 | [default7]: iteration 4908/ 128728 | consumed samples: 80160 | consumed tokens: 164167680 | elapsed time per iteration (s): 14.40 | learning rate: 2.627E-05 | global batch size: 32 | lm loss: 4.959438E+00 | grad norm: 0.531 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.222 | TFLOPs: 17.01 | [default7]: iteration 4909/ 128728 | consumed samples: 80192 | consumed tokens: 164233216 | elapsed time per iteration (s): 14.44 | learning rate: 2.628E-05 | global batch size: 32 | lm loss: 4.835641E+00 | grad norm: 1.725 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.216 | TFLOPs: 16.96 | [default7]: iteration 4910/ 128728 | consumed samples: 80224 | consumed tokens: 164298752 | elapsed time per iteration (s): 14.33 | learning rate: 2.629E-05 | global batch size: 32 | lm loss: 5.024042E+00 | grad norm: 0.484 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.233 | TFLOPs: 17.10 | [default7]: iteration 4911/ 128728 | consumed samples: 80256 | consumed tokens: 164364288 | elapsed time per iteration (s): 14.50 | learning rate: 2.630E-05 | global batch size: 32 | lm loss: 4.906248E+00 | grad norm: 0.459 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.208 | TFLOPs: 16.90 | [default7]: iteration 4912/ 128728 | consumed samples: 80288 | consumed tokens: 164429824 | elapsed time per iteration (s): 14.49 | learning rate: 2.631E-05 | global batch size: 32 | lm loss: 5.058978E+00 | grad norm: 0.466 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.208 | TFLOPs: 16.91 | [default7]: iteration 4913/ 128728 | consumed samples: 80320 | consumed tokens: 164495360 | elapsed time per iteration (s): 14.67 | learning rate: 2.632E-05 | global batch size: 32 | lm loss: 4.788593E+00 | grad norm: 0.613 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.182 | TFLOPs: 16.70 | [default7]: iteration 4914/ 128728 | consumed samples: 80352 | consumed tokens: 164560896 | elapsed time per iteration (s): 14.45 | learning rate: 2.633E-05 | global batch size: 32 | lm loss: 5.040935E+00 | grad norm: 0.546 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.215 | TFLOPs: 16.96 | [default7]: iteration 4915/ 128728 | consumed samples: 80384 | consumed tokens: 164626432 | elapsed time per iteration (s): 14.40 | learning rate: 2.634E-05 | global batch size: 32 | lm loss: 4.802517E+00 | grad norm: 0.756 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.223 | TFLOPs: 17.02 | [default7]: iteration 4916/ 128728 | consumed samples: 80416 | consumed tokens: 164691968 | elapsed time per iteration (s): 14.37 | learning rate: 2.635E-05 | global batch size: 32 | lm loss: 4.939359E+00 | grad norm: 0.618 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.226 | TFLOPs: 17.04 | [default7]: iteration 4917/ 128728 | consumed samples: 80448 | consumed tokens: 164757504 | elapsed time per iteration (s): 14.40 | learning rate: 2.636E-05 | global batch size: 32 | lm loss: 4.907125E+00 | grad norm: 0.481 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.222 | TFLOPs: 17.01 | [default7]: iteration 4918/ 128728 | consumed samples: 80480 | consumed tokens: 164823040 | elapsed time per iteration (s): 14.47 | learning rate: 2.637E-05 | global batch size: 32 | lm loss: 4.967450E+00 | grad norm: 0.447 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.211 | TFLOPs: 16.93 | [default7]: iteration 4919/ 128728 | consumed samples: 80512 | consumed tokens: 164888576 | elapsed time per iteration (s): 14.49 | learning rate: 2.638E-05 | global batch size: 32 | lm loss: 4.866137E+00 | grad norm: 0.494 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.208 | TFLOPs: 16.91 | [default7]: iteration 4920/ 128728 | consumed samples: 80544 | consumed tokens: 164954112 | elapsed time per iteration (s): 14.42 | learning rate: 2.639E-05 | global batch size: 32 | lm loss: 4.921837E+00 | grad norm: 0.535 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.219 | TFLOPs: 16.99 | [default7]: iteration 4921/ 128728 | consumed samples: 80576 | consumed tokens: 165019648 | elapsed time per iteration (s): 14.42 | learning rate: 2.640E-05 | global batch size: 32 | lm loss: 4.895585E+00 | grad norm: 0.542 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.220 | TFLOPs: 17.00 | [default7]: iteration 4922/ 128728 | consumed samples: 80608 | consumed tokens: 165085184 | elapsed time per iteration (s): 14.48 | learning rate: 2.641E-05 | global batch size: 32 | lm loss: 5.026771E+00 | grad norm: 0.449 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.209 | TFLOPs: 16.91 | [default7]: iteration 4923/ 128728 | consumed samples: 80640 | consumed tokens: 165150720 | elapsed time per iteration (s): 14.50 | learning rate: 2.642E-05 | global batch size: 32 | lm loss: 4.946136E+00 | grad norm: 0.500 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.207 | TFLOPs: 16.90 | [default7]: iteration 4924/ 128728 | consumed samples: 80672 | consumed tokens: 165216256 | elapsed time per iteration (s): 14.49 | learning rate: 2.643E-05 | global batch size: 32 | lm loss: 5.127237E+00 | grad norm: 0.516 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.208 | TFLOPs: 16.91 | [default7]: iteration 4925/ 128728 | consumed samples: 80704 | consumed tokens: 165281792 | elapsed time per iteration (s): 14.44 | learning rate: 2.645E-05 | global batch size: 32 | lm loss: 4.908696E+00 | grad norm: 0.496 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.216 | TFLOPs: 16.97 | [default7]: iteration 4926/ 128728 | consumed samples: 80736 | consumed tokens: 165347328 | elapsed time per iteration (s): 14.53 | learning rate: 2.646E-05 | global batch size: 32 | lm loss: 4.818577E+00 | grad norm: 0.490 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.202 | TFLOPs: 16.86 | [default7]: iteration 4927/ 128728 | consumed samples: 80768 | consumed tokens: 165412864 | elapsed time per iteration (s): 14.36 | learning rate: 2.647E-05 | global batch size: 32 | lm loss: 5.022501E+00 | grad norm: 0.487 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.229 | TFLOPs: 17.07 | [default7]: iteration 4928/ 128728 | consumed samples: 80800 | consumed tokens: 165478400 | elapsed time per iteration (s): 14.63 | learning rate: 2.648E-05 | global batch size: 32 | lm loss: 5.095439E+00 | grad norm: 0.474 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.188 | TFLOPs: 16.75 | [default7]: iteration 4929/ 128728 | consumed samples: 80832 | consumed tokens: 165543936 | elapsed time per iteration (s): 14.56 | learning rate: 2.649E-05 | global batch size: 32 | lm loss: 4.867841E+00 | grad norm: 0.532 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.198 | TFLOPs: 16.83 | [default7]: iteration 4930/ 128728 | consumed samples: 80864 | consumed tokens: 165609472 | elapsed time per iteration (s): 14.47 | learning rate: 2.650E-05 | global batch size: 32 | lm loss: 4.895494E+00 | grad norm: 0.597 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.211 | TFLOPs: 16.93 | [default7]: iteration 4931/ 128728 | consumed samples: 80896 | consumed tokens: 165675008 | elapsed time per iteration (s): 14.39 | learning rate: 2.651E-05 | global batch size: 32 | lm loss: 4.990711E+00 | grad norm: 0.545 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.224 | TFLOPs: 17.03 | [default7]: iteration 4932/ 128728 | consumed samples: 80928 | consumed tokens: 165740544 | elapsed time per iteration (s): 14.48 | learning rate: 2.652E-05 | global batch size: 32 | lm loss: 5.046138E+00 | grad norm: 0.447 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.211 | TFLOPs: 16.93 | [default7]: iteration 4933/ 128728 | consumed samples: 80960 | consumed tokens: 165806080 | elapsed time per iteration (s): 14.43 | learning rate: 2.653E-05 | global batch size: 32 | lm loss: 4.994790E+00 | grad norm: 0.660 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.218 | TFLOPs: 16.98 | [default7]: iteration 4934/ 128728 | consumed samples: 80992 | consumed tokens: 165871616 | elapsed time per iteration (s): 14.52 | learning rate: 2.654E-05 | global batch size: 32 | lm loss: 4.952404E+00 | grad norm: 0.516 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.204 | TFLOPs: 16.87 | [default7]: iteration 4935/ 128728 | consumed samples: 81024 | consumed tokens: 165937152 | elapsed time per iteration (s): 14.44 | learning rate: 2.655E-05 | global batch size: 32 | lm loss: 5.107005E+00 | grad norm: 0.656 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.216 | TFLOPs: 16.96 | [default7]: iteration 4936/ 128728 | consumed samples: 81056 | consumed tokens: 166002688 | elapsed time per iteration (s): 14.76 | learning rate: 2.656E-05 | global batch size: 32 | lm loss: 4.917874E+00 | grad norm: 0.472 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.168 | TFLOPs: 16.60 | [default7]: iteration 4937/ 128728 | consumed samples: 81088 | consumed tokens: 166068224 | elapsed time per iteration (s): 14.49 | learning rate: 2.657E-05 | global batch size: 32 | lm loss: 4.967998E+00 | grad norm: 0.729 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.209 | TFLOPs: 16.91 | [default7]: iteration 4938/ 128728 | consumed samples: 81120 | consumed tokens: 166133760 | elapsed time per iteration (s): 14.43 | learning rate: 2.658E-05 | global batch size: 32 | lm loss: 4.925600E+00 | grad norm: 0.492 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.218 | TFLOPs: 16.98 | [default7]: iteration 4939/ 128728 | consumed samples: 81152 | consumed tokens: 166199296 | elapsed time per iteration (s): 14.57 | learning rate: 2.659E-05 | global batch size: 32 | lm loss: 4.884789E+00 | grad norm: 0.462 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.197 | TFLOPs: 16.82 | [default7]: iteration 4940/ 128728 | consumed samples: 81184 | consumed tokens: 166264832 | elapsed time per iteration (s): 14.37 | learning rate: 2.660E-05 | global batch size: 32 | lm loss: 4.857765E+00 | grad norm: 0.567 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.227 | TFLOPs: 17.05 | [default7]: iteration 4941/ 128728 | consumed samples: 81216 | consumed tokens: 166330368 | elapsed time per iteration (s): 14.75 | learning rate: 2.661E-05 | global batch size: 32 | lm loss: 4.846112E+00 | grad norm: 0.543 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.170 | TFLOPs: 16.61 | [default7]: iteration 4942/ 128728 | consumed samples: 81248 | consumed tokens: 166395904 | elapsed time per iteration (s): 14.46 | learning rate: 2.662E-05 | global batch size: 32 | lm loss: 5.160878E+00 | grad norm: 0.499 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.214 | TFLOPs: 16.95 | [default7]: iteration 4943/ 128728 | consumed samples: 81280 | consumed tokens: 166461440 | elapsed time per iteration (s): 14.51 | learning rate: 2.663E-05 | global batch size: 32 | lm loss: 5.023970E+00 | grad norm: 0.720 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.205 | TFLOPs: 16.88 | [default7]: iteration 4944/ 128728 | consumed samples: 81312 | consumed tokens: 166526976 | elapsed time per iteration (s): 14.48 | learning rate: 2.664E-05 | global batch size: 32 | lm loss: 4.885333E+00 | grad norm: 0.945 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.209 | TFLOPs: 16.91 | [default7]: iteration 4945/ 128728 | consumed samples: 81344 | consumed tokens: 166592512 | elapsed time per iteration (s): 14.35 | learning rate: 2.665E-05 | global batch size: 32 | lm loss: 4.871268E+00 | grad norm: 0.612 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.230 | TFLOPs: 17.08 | [default7]: iteration 4946/ 128728 | consumed samples: 81376 | consumed tokens: 166658048 | elapsed time per iteration (s): 14.47 | learning rate: 2.667E-05 | global batch size: 32 | lm loss: 5.091561E+00 | grad norm: 0.596 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.211 | TFLOPs: 16.93 | [default7]: iteration 4947/ 128728 | consumed samples: 81408 | consumed tokens: 166723584 | elapsed time per iteration (s): 14.41 | learning rate: 2.668E-05 | global batch size: 32 | lm loss: 5.002282E+00 | grad norm: 0.889 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.220 | TFLOPs: 17.00 | [default7]: iteration 4948/ 128728 | consumed samples: 81440 | consumed tokens: 166789120 | elapsed time per iteration (s): 14.34 | learning rate: 2.669E-05 | global batch size: 32 | lm loss: 4.788313E+00 | grad norm: 0.660 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.232 | TFLOPs: 17.09 | [default7]: iteration 4949/ 128728 | consumed samples: 81472 | consumed tokens: 166854656 | elapsed time per iteration (s): 14.49 | learning rate: 2.670E-05 | global batch size: 32 | lm loss: 4.989025E+00 | grad norm: 0.515 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.208 | TFLOPs: 16.91 | [default7]: iteration 4950/ 128728 | consumed samples: 81504 | consumed tokens: 166920192 | elapsed time per iteration (s): 14.41 | learning rate: 2.671E-05 | global batch size: 32 | lm loss: 4.957456E+00 | grad norm: 0.542 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.221 | TFLOPs: 17.00 | [default7]: iteration 4951/ 128728 | consumed samples: 81536 | consumed tokens: 166985728 | elapsed time per iteration (s): 14.53 | learning rate: 2.672E-05 | global batch size: 32 | lm loss: 4.846943E+00 | grad norm: 0.478 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.203 | TFLOPs: 16.87 | [default7]: iteration 4952/ 128728 | consumed samples: 81568 | consumed tokens: 167051264 | elapsed time per iteration (s): 14.39 | learning rate: 2.673E-05 | global batch size: 32 | lm loss: 5.056949E+00 | grad norm: 0.601 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.224 | TFLOPs: 17.03 | [default7]: iteration 4953/ 128728 | consumed samples: 81600 | consumed tokens: 167116800 | elapsed time per iteration (s): 14.53 | learning rate: 2.674E-05 | global batch size: 32 | lm loss: 5.134397E+00 | grad norm: 0.462 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.203 | TFLOPs: 16.87 | [default7]: iteration 4954/ 128728 | consumed samples: 81632 | consumed tokens: 167182336 | elapsed time per iteration (s): 14.46 | learning rate: 2.675E-05 | global batch size: 32 | lm loss: 5.089641E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.214 | TFLOPs: 16.95 | [default7]: iteration 4955/ 128728 | consumed samples: 81664 | consumed tokens: 167247872 | elapsed time per iteration (s): 14.40 | learning rate: 2.676E-05 | global batch size: 32 | lm loss: 4.803981E+00 | grad norm: 0.487 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.223 | TFLOPs: 17.02 | [default7]: iteration 4956/ 128728 | consumed samples: 81696 | consumed tokens: 167313408 | elapsed time per iteration (s): 14.46 | learning rate: 2.677E-05 | global batch size: 32 | lm loss: 4.882299E+00 | grad norm: 0.565 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.214 | TFLOPs: 16.95 | [default7]: iteration 4957/ 128728 | consumed samples: 81728 | consumed tokens: 167378944 | elapsed time per iteration (s): 14.52 | learning rate: 2.678E-05 | global batch size: 32 | lm loss: 4.925550E+00 | grad norm: 0.562 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.203 | TFLOPs: 16.87 | [default7]: iteration 4958/ 128728 | consumed samples: 81760 | consumed tokens: 167444480 | elapsed time per iteration (s): 14.47 | learning rate: 2.679E-05 | global batch size: 32 | lm loss: 4.976944E+00 | grad norm: 0.503 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.211 | TFLOPs: 16.93 | [default7]: iteration 4959/ 128728 | consumed samples: 81792 | consumed tokens: 167510016 | elapsed time per iteration (s): 14.44 | learning rate: 2.680E-05 | global batch size: 32 | lm loss: 4.880012E+00 | grad norm: 0.553 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.217 | TFLOPs: 16.97 | [default7]: iteration 4960/ 128728 | consumed samples: 81824 | consumed tokens: 167575552 | elapsed time per iteration (s): 14.31 | learning rate: 2.681E-05 | global batch size: 32 | lm loss: 4.842023E+00 | grad norm: 0.565 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.237 | TFLOPs: 17.13 | [default7]: iteration 4961/ 128728 | consumed samples: 81856 | consumed tokens: 167641088 | elapsed time per iteration (s): 14.45 | learning rate: 2.682E-05 | global batch size: 32 | lm loss: 4.906799E+00 | grad norm: 1.398 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.215 | TFLOPs: 16.96 | [default7]: iteration 4962/ 128728 | consumed samples: 81888 | consumed tokens: 167706624 | elapsed time per iteration (s): 14.37 | learning rate: 2.683E-05 | global batch size: 32 | lm loss: 5.035071E+00 | grad norm: 1.697 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.226 | TFLOPs: 17.04 | [default7]: iteration 4963/ 128728 | consumed samples: 81920 | consumed tokens: 167772160 | elapsed time per iteration (s): 14.49 | learning rate: 2.684E-05 | global batch size: 32 | lm loss: 4.912130E+00 | grad norm: 0.547 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.208 | TFLOPs: 16.91 | [default7]: iteration 4964/ 128728 | consumed samples: 81952 | consumed tokens: 167837696 | elapsed time per iteration (s): 14.78 | learning rate: 2.685E-05 | global batch size: 32 | lm loss: 4.826226E+00 | grad norm: 0.514 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.165 | TFLOPs: 16.58 | [default7]: iteration 4965/ 128728 | consumed samples: 81984 | consumed tokens: 167903232 | elapsed time per iteration (s): 14.65 | learning rate: 2.686E-05 | global batch size: 32 | lm loss: 4.870893E+00 | grad norm: 0.498 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.184 | TFLOPs: 16.72 | [default7]: iteration 4966/ 128728 | consumed samples: 82016 | consumed tokens: 167968768 | elapsed time per iteration (s): 14.47 | learning rate: 2.688E-05 | global batch size: 32 | lm loss: 4.855809E+00 | grad norm: 0.661 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.212 | TFLOPs: 16.93 | [default7]: iteration 4967/ 128728 | consumed samples: 82048 | consumed tokens: 168034304 | elapsed time per iteration (s): 14.40 | learning rate: 2.689E-05 | global batch size: 32 | lm loss: 5.050081E+00 | grad norm: 0.540 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.223 | TFLOPs: 17.02 | [default7]: iteration 4968/ 128728 | consumed samples: 82080 | consumed tokens: 168099840 | elapsed time per iteration (s): 14.46 | learning rate: 2.690E-05 | global batch size: 32 | lm loss: 4.922202E+00 | grad norm: 0.543 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.213 | TFLOPs: 16.95 | [default7]: iteration 4969/ 128728 | consumed samples: 82112 | consumed tokens: 168165376 | elapsed time per iteration (s): 14.40 | learning rate: 2.691E-05 | global batch size: 32 | lm loss: 4.749779E+00 | grad norm: 0.614 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.222 | TFLOPs: 17.01 | [default7]: iteration 4970/ 128728 | consumed samples: 82144 | consumed tokens: 168230912 | elapsed time per iteration (s): 14.47 | learning rate: 2.692E-05 | global batch size: 32 | lm loss: 4.917465E+00 | grad norm: 0.824 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.211 | TFLOPs: 16.93 | [default7]: iteration 4971/ 128728 | consumed samples: 82176 | consumed tokens: 168296448 | elapsed time per iteration (s): 14.66 | learning rate: 2.693E-05 | global batch size: 32 | lm loss: 4.817117E+00 | grad norm: 0.604 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.183 | TFLOPs: 16.72 | [default7]: iteration 4972/ 128728 | consumed samples: 82208 | consumed tokens: 168361984 | elapsed time per iteration (s): 14.46 | learning rate: 2.694E-05 | global batch size: 32 | lm loss: 4.984338E+00 | grad norm: 0.438 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.213 | TFLOPs: 16.94 | [default7]: iteration 4973/ 128728 | consumed samples: 82240 | consumed tokens: 168427520 | elapsed time per iteration (s): 14.69 | learning rate: 2.695E-05 | global batch size: 32 | lm loss: 4.920941E+00 | grad norm: 0.513 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.178 | TFLOPs: 16.67 | [default7]: iteration 4974/ 128728 | consumed samples: 82272 | consumed tokens: 168493056 | elapsed time per iteration (s): 14.38 | learning rate: 2.696E-05 | global batch size: 32 | lm loss: 4.996977E+00 | grad norm: 0.565 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.225 | TFLOPs: 17.04 | [default7]: iteration 4975/ 128728 | consumed samples: 82304 | consumed tokens: 168558592 | elapsed time per iteration (s): 14.49 | learning rate: 2.697E-05 | global batch size: 32 | lm loss: 4.976236E+00 | grad norm: 0.542 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.209 | TFLOPs: 16.91 | [default7]: iteration 4976/ 128728 | consumed samples: 82336 | consumed tokens: 168624128 | elapsed time per iteration (s): 14.49 | learning rate: 2.698E-05 | global batch size: 32 | lm loss: 4.962553E+00 | grad norm: 0.533 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.208 | TFLOPs: 16.91 | [default7]: iteration 4977/ 128728 | consumed samples: 82368 | consumed tokens: 168689664 | elapsed time per iteration (s): 14.66 | learning rate: 2.699E-05 | global batch size: 32 | lm loss: 4.922136E+00 | grad norm: 0.471 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.182 | TFLOPs: 16.71 | [default7]: iteration 4978/ 128728 | consumed samples: 82400 | consumed tokens: 168755200 | elapsed time per iteration (s): 14.49 | learning rate: 2.700E-05 | global batch size: 32 | lm loss: 4.775953E+00 | grad norm: 0.470 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.208 | TFLOPs: 16.91 | [default7]: iteration 4979/ 128728 | consumed samples: 82432 | consumed tokens: 168820736 | elapsed time per iteration (s): 14.36 | learning rate: 2.701E-05 | global batch size: 32 | lm loss: 4.841665E+00 | grad norm: 0.487 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.228 | TFLOPs: 17.06 | [default7]: iteration 4980/ 128728 | consumed samples: 82464 | consumed tokens: 168886272 | elapsed time per iteration (s): 14.41 | learning rate: 2.702E-05 | global batch size: 32 | lm loss: 4.885078E+00 | grad norm: 0.477 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.221 | TFLOPs: 17.01 | [default7]: iteration 4981/ 128728 | consumed samples: 82496 | consumed tokens: 168951808 | elapsed time per iteration (s): 14.67 | learning rate: 2.703E-05 | global batch size: 32 | lm loss: 4.872721E+00 | grad norm: 0.550 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.181 | TFLOPs: 16.70 | [default7]: iteration 4982/ 128728 | consumed samples: 82528 | consumed tokens: 169017344 | elapsed time per iteration (s): 14.43 | learning rate: 2.704E-05 | global batch size: 32 | lm loss: 4.986514E+00 | grad norm: 0.547 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.218 | TFLOPs: 16.98 | [default7]: iteration 4983/ 128728 | consumed samples: 82560 | consumed tokens: 169082880 | elapsed time per iteration (s): 14.35 | learning rate: 2.705E-05 | global batch size: 32 | lm loss: 4.904243E+00 | grad norm: 0.485 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.230 | TFLOPs: 17.07 |